CliMA / ClimaParams.jl

Contains all universal constant and physical parameters in CLIMA
Apache License 2.0
35 stars 5 forks source link

Internal representation #59

Open simonbyrne opened 2 years ago

simonbyrne commented 2 years ago

As distinct from the the TOML format, it might be worth rethinking how we represent parameters internally. There are pros and cons to each but it is worth explaining them a bit. As far as I can see, there are 3 broad options:

1. A package of constants

Conceptually this is the simplest: we get rid of the notion of a "parameter set" completely, and you simply have a package with a bunch of constants:

module CLIMAParameters
const param1 = 1.2
...

(note: this code would be generated by the toml file)

This is very easy for the user to make use of:

import CLIMAParameters: param1

f(x) = x*param1

The original reason we didn't use this is that well, they're constant: you can't modify them. However now with the new Preferences system, we would change the definition above to

module CLIMAParameters
using Preferences
const param1 = @load_preference("param1", 1.2)
...

and then the user could modify it by the LocalPreferences.toml file in their current experiment directory

[CLIMAParameters]
param1 = 1.3

This would then correctly trigger recompilation when the value is modified, and the changes would only apply to the current experiment.

However there are a couple of downsides:

We might be able to work around these by defining them as Refs, but then user code would look like

f(x) = x*param1[]

which isn't particularly nice.

2. Constant functions

This is the approach we use now: a function which returns a constant value:

module CLIMAParameters
param1(::ParameterSet) = 1.2
...

At the user level, this involves:

  1. passing around a "parameter set" object to the different functions
  2. evaluating the function in the scope in which you require the parameter. Since the usual symbol name is already taken by the function, this has lead to an ugly and somewhat confusing convention of using underscores:
    function f(paramset, x)
    _param1 = param1(paramset)
    x*_param1
    end

In cases where you might be doing multiple runs with different values of a given parameter, the intention was that you would create a new parameter set type with fields for the modifiable parameter values:

struct CustomParameterSet
   param1::Float64
end
param1(ps::CustomParameterSet) = ps.param1

This is roughly what @charleskawczynski did in TurbulenceConvection.jl: https://github.com/CliMA/TurbulenceConvection.jl/blob/2a0cea8c184b5f8ffe13f9862aaf3197aa8eca56/driver/parameter_set.jl#L43

As a possible fix to the interface issue, in #57 I proposed a slight modification.

  1. a parameter is now a singleton type (i.e. a struct with no fields): the name of this type should be descriptive
  2. its value can be extracted by a value function
  3. we use getproperty overloading of the parameter sets to lookup by symbol.

Roughly, the (generated) module now becomes

module CLIMAParameters
struct DefaultParameterSet end
struct Parameter1 end
value(::DefaultParameterSet, ::Parameter1) = 1.2

@inline function Base.getproperty(ps::DefaultParameterSet, name::Symbol)
    if name == :param1
        value(ps, Parameter1())
...

and then the user code looks like:

f(ps, x) = x*ps.param1
# or you can define it as a local variable
function f(ps, x)
   param1 = ps.param1
   # or
   @unpack param1 = ps
   # or
   (;param1) = ps
   # or the long version
   param1 = CLIMAParameters.value(ps, CLIMAParameters.Parameter1())
   x * param1
end

After some consideration, I think something along these lines would be my personal preference.

3. A dictionary

This is at the other end of the extreme: basically CLIMAParameters just gives a dictionary of values, and it is up to the model components to extract the necessary values at construction (to avoid the cost incurred of looking up dictionaries), e.g.

struct ComponentA
   param1::Float64
   ...
end
ComponentA(ps::ParameterSet) = ComponentA(ps[:param1], ...) # values are copied at construction

f(c::ComponentA, x) = x * c.param1

This would mean that components could be run completely independent of CLIMAParameters. However it could mean that we end up duplicating a lot of code and constants and end up just pushing the problem down into individual components, e.g. Thermodynamics.jl would probably end up defining its own ThermodynamicsParameters object to capture all the parameters needed for thermodynamics.

tapios commented 2 years ago

We definitely do not want to trigger a recompile for every parameter change. Once you have a full ESM, this will become unwieldy. We want parameters to be runtime constants of some form.

odunbar commented 2 years ago

In 3. The "extraction" can be automated by tagging parameters? Then do we still need duplicate code, or can we not write the boilerplate in ClimaParameters? e.g. ClimaParameters.create_parameter_struct(component, tags)

You are right, this leads to many structs, but they are organized well: one for each "component" and that struct contains only the parameters used within the component's source code. Once again in totatlity, are we stll not talking (ignoring NN parameters) at 10-100 floats per component and maybe 10-20 components, this doesn't seem too crazy.

simonbyrne commented 2 years ago

In 3. The "extraction" can be automated by tagging parameters? Then do we still need duplicate code, or can we not write the boilerplate in ClimaParameters?

We could probably figure something out, e.g. a macro

@parameterstruct(paramdict, ComponentX)

which expands to

struct ComponentXParameters
   param1::Float64 # the set of parameters tagged "Component1"
   ...
end

A couple of things worth noting:

odunbar commented 2 years ago

RE the tag specificity. I think this is why I keep mentioning about the organization of code, for us what do we mean by component (does this relate to a git repo?) or subcomponent (does this relate to a module?) etc.

The tag specifics are a trade-off. broad tags (e.g. a git repo) will produce fewer, larger structs, narrow tags (e.g. a module) will produce more, smaller structs.

simonbyrne commented 2 years ago

Modules are namespaces: i.e. a way to encapsulate things so that names are unique within a module. Packages are modules with a bit of extra metadata to resolve dependencies (and typically match 1-1 with a repo).

simonbyrne commented 2 years ago

After some further discussion with @odunbar, we agreed on the following approach (based on option 3, and similar to what is done in OceanTurbulenceParameterEstimation.jl:

  1. A parameters set is basically a dictionary, which contains the values and associated metadata.

  2. Model components which use any parameters would need to access them when they are constructed (i.e. not while the model is running), and load them into their own struct (or similar object), or subcomponens: e.g. suppose we define a model such as:

    struct AtmosModel
    # parameters required directly by AtmosModel
    gravitational_acceleration
    coriolis
    ...
    # subcomponents
    thermodynamics
    ...
    end

    Then the constructor

    AtmosModel(paramset; thermodynamics=MoistEquilibrium())

    would do something like

    # extract parameters required by model
    gravitational_acceleration = get_parameter(paramset, :gravitational_acceleration, "AtmosModel")
    coriolis = get_parameter(paramset, :coriolis, "AtmosModel")
    # recursively extract subcomponents
    thermodynamics = update(MoistEquilibrium(), paramset) # constructs a new MoistEquilibrium object with the relevant parameters
  3. Once the components are constructed, they would refer to their own object to get the parameter value (i.e. we shouldn't be looking up parameter sets once the model is running). For example, Thermodynamics.gas_constant_air:

https://github.com/CliMA/Thermodynamics.jl/blob/31231f74fd46268f9f6231a5f24aa7cfcdb000a2/src/relations.jl#L72-L77

would now refer to the specific MoistEquilibrium object instead of the parameter set:

function gas_constant_air(thermo::MoistEquilibrium, q::PhasePartition{FT}) where {FT}
    return thermo.R_d *
           (1 + (thermo.molmass_ratio - 1) * q.tot - thermo.molmass_ratio * (q.liq + q.ice))
end
  1. We can build some convenience functions for automatically mapping parameters to struct members; also this would allow us to log exactly which parameters were used in a given experiment, and by which components, e.g. by defining
    function get_parameter(paramset, name, component)
    # log that the parameter was used
    @info(paramset, name, component)
    # get the value
    return paramset.dict[name]
    end

This is significant change from our current approach: unfortunately, there isn't an easy way to gradually switch over, but it is probably worth prototyping this in a single package (e.g. TurbulenceConvection or Thermodynamics?).

bischtob commented 2 years ago

I love this.

charleskawczynski commented 2 years ago

Sorry I'm late to the party.

I don't mean to open a can of worms, but @kpamnany and I have already voiced concerns about the distributed approach (option 3), as it does not guarantee the "single parameter" requirement that we were given in the very early stages of developing a design solution to CliMA's parameters.

If we go down this road of distributed parameters, then it sounds like we've relaxed this requirement. The existing approach (using methods, and not types) ensures this by construction. The distributed approach, where each package / module has its own parameters via separate types, is subject to having different parameters and the responsibility of ensuring a single value is used for all parameters is pushed to the outer-most layer (i.g., tests/experiments/drivers for all other repos, i.e, the user). Here's a simple example:

atmos = AtmosModel(
thermo = ThermoParams(; grav = 1),          # override for grav in Thermodynamics.jl
microphys = MicrophysicsParams(; grav = 2), # override for grav in CloudMicrophysics.jl
)

Arguably, this is something that obviously looks wrong and "won't typically happen", but the point is that it can happen. Let's consider a slightly more complex situation:

atmos = AtmosModel(
thermo = ThermoParams(),                    # I can use Thermo's defaults, right? ¯\_(ツ)_/¯
microphys = MicrophysicsParams(; grav = 2), # override for grav in CloudMicrophysics.jl
)

And now we have inconsistent parameters.

Now, I'm not saying that we shouldn't take the distributed approach--our original design choice was sketched out based on the given requirements at the time. This distributed approach may look fine / easy / sensible when looking at a single package, like Thermodynamics or Oceananigans. But, our existing method-based design satisfies the single parameter requirement across all packages.

My preference would be to open two issues: 1) regarding the hard-aches with our existing design and 2) regarding additional needs from our existing design. For 2), #58 looks great, and that can work with our existing design. For 1), I imagine that we will be able to alleviate hard-aches with our existing design if they are laid out. I think I recall a preference for α = param_set.α over α = CP.α(param_set), but I don't see why this is problematic and I've yet to see a concrete technical reason to change our existing design.

kpamnany commented 2 years ago

Perhaps it would be good to start by documenting the requirements of the parameters feature?

tapios commented 2 years ago

Some distinction here may help w.r.t. to when we do and don't need consistent parameters:

  1. All "planet" parameters should be the same across model components. This includes thermodynamic constants, planetary rotation rate, mean radius etc., and crucially, e.g., the latent heat of vaporization at the triple point (land, ocean, atmosphere, ice should use the same to ensure energy conservation for the coupled model). These should all live in some "planet" file shared by all model components.

  2. Many other parameters can plausibly be different for different model components. For example, the Smagorinsky coefficient for an atmosphere and ocean model may well be different. We do not need to enforce consistency here.

trontrytel commented 2 years ago
  1. Many other parameters can plausibly be different for different model components. For example, the Smagorinsky coefficient for an atmosphere and ocean model may well be different. We do not need to enforce consistency here.

Wouldn't it still be safer in that case to have Smagorinsky_ocean and Smagorinsky_atmos coefficient, so that we keep the names unique across the codebase?

trontrytel commented 2 years ago

I think a good usage example was when for debugging purposes we wanted to set the freezing temperature to something very low to test the TurbulenceConvection.jl package. This required overwriting T_freeze for Thermodynamics.jl, CloudMicrophysics.jl and TurbulenceConvection.jl at the same time. It would be nice to still be able to do this easily and in one place, without having to check on individual packages.

tapios commented 2 years ago

3. Many other parameters can plausibly be different for different model components. For example, the Smagorinsky coefficient for an atmosphere and ocean model may well be different. We do not need to enforce consistency here.

Wouldn't it still be safer in that case to have Smagorinsky_ocean and Smagorinsky_atmos coefficient, so that we keep the names unique across the codebase?

Maybe. But if (one day) atmosphere and ocean use the same CFD code base, you couldn't easily do that. The broader point is: there are parameters that are not "universal" constants, and at least for experimentation, there may be situation where we want them to be different in different model components.

odunbar commented 2 years ago

Sorry I just saw this thread, and thanks for voicing these concerns, @charleskawczynski ! I think to summarize, what our current setup does is remove the word "enforce" in the requirements and replace it with "make it difficult to change". As @kpamnany and @tapios note,I think this results in a softening of the need for universal parameters; however in practice, it seems to be the use case that we would rather have such common sense principles over enforcement, as it is likely we will require such flexibility down the road.

On @trontrytel and @tapios conversation. Yet another option is (if relationships are known between constants)

  1. you define smagorinsky only in the parameter file with a description "atmospheric smagorinsky"
  2. this gets distributed around to all components that want it.
  3. when you build ocean=OceanComponent you create a "derived parameter" with smagorinsky = smagorinsky + epsilon
  4. all ocean code will call ocean.smagorinsky

The difference here to previous suggestions is that only one constant is defined in the parameter file, and, changes to that parameter occur at the component level of the use case (and no further). You can also protect from confusion by renaming the local smagorinsky (e.g. ocean_smagorinsky to indicate it is different). If not then you know where the parameter is defined because it is called as ocean.parameter and so you know it is defined in the OceanComponent constructor.

glwagner commented 2 years ago

Rather than building support for "parameter sets" into the model constructors, it might be less intrusive to require individual components (eg mutable struct AtmosModel to define a function

set!(model, parameter_set)

that mutates model components according to the contents of parameter_set. We have prototype support for this functionality in Oceananigans / OceanTurbulenceParameterEstimation for the purpose of calibrating turbulence closures, and it's useful for running a few different models without recompiling or reallocating memory for a new simulation.

The main benefit of this pattern is that it simplifies the model constructors, which seem to end up being complicated pieces of code.

simonbyrne commented 2 years ago

Given the discussion here: https://github.com/CliMA/TurbulenceConvection.jl/pull/766#discussion_r795078598 Options 1 and 2 aren't really going to be feasible if you get into arrays of hundreds or thousands values.