esm-tools / pymorize

A Python based Tool to CMORize NetCDF Data
MIT License
0 stars 1 forks source link

Parsing of tables (data request) #18

Closed mandresm closed 1 month ago

mandresm commented 3 months ago

Builds the data request information from the tables, making sure there is not variable repetition and correct merging of the information spread across the different tables.

https://github.com/FESOM/seamore/blob/7725366f7b68ea3824ac6baa500ea49531722b72/lib/data_request.rb#L7

The file consists in 4 classes:

The overall design ensures that variables are merged correctly across multiple tables, providing a unified interface to query variables and their metadata.

Crappy ~UML

flowchart LR

    subgraph DataRequest
        subgraph DRinitialize[initialize]
            direction TB
            table_initialization
            --> var_initialization
            --> for_vars[for v in vars]
            --> if_v_exists[    if v exists]
            --> if_v_!exists[    else]
        end
    end
    subgraph DataRequestVariable
        new_from_table_var_entry
        --> DRVinitialize[initialize]
        merge_table_var_entry
    end
    subgraph TableVarEntry
        TVEinitialize[initialize]
    end
    subgraph DataRequestTable
        DRTinitialize[initialize]
        variable_entries
    end

    table_initialization --> DRTinitialize
    var_initialization --> variable_entries
    variable_entries --> TVEinitialize
    if_v_exists --> merge_table_var_entry
    if_v_!exists --> new_from_table_var_entry
pgierz commented 2 months ago

I have some design questions about this, in particular how it will fit into the rest of our current config structure. Currently we have the following, and some of what I write down here will need to end up in the handbook (which I nominate @chrisdane and @christian-stepanek to improve upon once I draft it, since user-facing documentation is best written by early test users)

pymorize:
   # ... program metadata and settings for the program itself ...
general:  # (Or global, we haven't decided the name yet)
   # ... Information that will always be relevant
pipelines:
   # ... collections of steps to apply ...
rules:
   # ... list of rules (see below) ...

The rules section is the main part the users will need to deal with. It is a list of dictionaries, and maps a collection of user files to a CMOR variable. A typical rule might look like this:

rules:
    - cmor_variable: so
      model_variable: salt
      cmor_units: PSU
      model_units: PSU  # Can be omitted if NetCDF meta-data is complete. If given, value
                        # in the rules sections will win over what is in the NetCDF, always.
      file_patterns:
          - /a/pattern/with/fesom.salt.(Pyear/d+).*nc  # Use Python-extended regex, **not**
                                                       # globbing!!!

The rules specification is a work-in-progress, and not set in stone (yet). Still to be considered are output files, variables that end up in multiple files (time aggregation), CMOR variables that depend on multiple inputs...

pgierz commented 2 months ago

Sorry, I forgot to actually ask my question: I guess a Rule will be responsible for one single DataRequestVariable and contain all the information needed to generate that variable. Question: does that make sense, or am I overlooking some edge case

chrisdane commented 2 months ago

I am not quite sure but I think yes.

In this particular example I don't understand why cmor_units should be defined in the rule, i.e. by the user (?). In my view this information should be retrieved from the cmip6-cmor-tables repo.

Also, in the rule the cmip table that defines the variable of interest must be given, i.e. in this example Omon or Odec:

cd cmip6-cmor-tables/Tables
grep "\"so\":" *
CMIP6_Odec.json:        "so": {
CMIP6_Omon.json:        "so": {
pgierz commented 2 months ago

That is just for completeness to show what kind of information will be in a rule. Not everything in the rule will be asked from the user, only ambiguous information. The actual cmor unit value will be parsed from the table, along with possibly many other things.

christian-stepanek commented 2 months ago

I think that the rules section depends only on two conditions: the variable and metadata definition as made for the MIP (CMIP7 FastTrack, for example), and the definition of the variable and metadata as made in the model. Once the rules are set up for a specific MIP and for a specific model, the hope would be that nobody will have to tamper with that section anymore.

christian-stepanek commented 2 months ago

@chrisdane @pgierz - maybe the definition of "user" is misleading here. The aforementioned user would be the person that defines the rules based on conditions in the model and data request demands for the MIP. From my point of view the "user" would not be the individual modeller. They would in most cases use the predefined rules as they are.

Does this answer more questions than it raises?

pgierz commented 2 months ago

@christian-stepanek: Correct! Well, sort of. One "user" (e.g. not someone developing the actual logic of the pymorize tool) still needs to sit down and write the mapping of CMOR to Model. That can then of course be shared. What we (the HPC team) would give is the framework for how to write down such rules. Filling them with useful values is of course up to you ;)

christian-stepanek commented 2 months ago

Re: "I guess a Rule will be responsible for one single DataRequestVariable and contain all the information needed to generate that variable. Question: does that make sense, or am I overlooking some edge case"

I think this is correct. Every variable will have one specific set of rules that define how it is to be computed, formatted, which sign conventions are applied, which metadata is to be included into NetCDF, etc.

Note, that there will be various different instances of "the same" physical model output. As @chrisdane stated above, the same variable, e.g. SAT, may be present in different CMOR tables, and different CMOR rules may apply. For example, tas (SAT) is available in both Amon and Aday tables (and in some others as well). The most relevant difference between them is that the Amon version of the variable is to be computed as monthly means, whereas the Aday version is to be computed as daily means. Therefore, the rule for Amon.tas will be different from the rule for Aday.tas, at least with regard to the definition of the time mean. Whether other things differ must be deduced from a comparison of the respective CMOR tables.

I do not yet fully understand the diagram that you provide above. If there is something more detailed to understand and to discuss then maybe it is best to do that in our CMOR meeting.