manage domain dependent data requests

WCRP-CORDEX / data-request-table

Machine readable data request tables

MIT License

0 stars 0 forks source link

manage domain dependent data requests #36

Open larsbuntemeyer opened 1 month ago

larsbuntemeyer commented 1 month ago

We would like to be able to manage domain specific data requests. I think we could add a colum like domain_id to the csv table that's being empty by default since the main data request is valid for all domains. If a specific variable at a certain frequency is only requested by a certain domain/community in addition to the default data request, the domain_id column get's an entry. This way, we can easily sort out domain specific data requests later, e.g., split into domain specific lists of variables, etc... However, the cmor tables would always contain all required entries. We can also easily create, e.g., execl sheets or html tables sorted, e.g., by the domain_id or priority, etc...

larsbuntemeyer commented 1 month ago

my prototype code to create excel with 3 sheets (one per priority):

import pandas as pd

cols = ['out_name', 'standard_name', 'long_name', 'cell_methods']

def human_readable(df):
    return df.groupby(cols)['frequency'].apply(list).to_frame()#.reset_index()

#df.groupby('priority').apply(human_readable, include_groups=False)
df = pd.read_csv("CORDEX-CMIP6/data-request.csv")
#human_readable(df)
sheets = {k: human_readable(v) for k, v in df.groupby('priority')}

with pd.ExcelWriter('data-request.xlsx') as writer:
    for k, v in sheets.items():
        v.to_excel(writer, sheet_name=k, index=True)
        nlevels = v.index.nlevels + len(v.columns)
        worksheet = writer.sheets[k]  # pull worksheet object
        for i in range(nlevels):
            worksheet.set_column(i, i, 40)

This would, e.g., recreate more or less the original google docs spreadsheet we started with....

data-request.xlsx

jesusff commented 1 month ago

Hi @larsbuntemeyer With this setup (additional column for domain_id) it seems harder to account for different priorities for a variable across domains (e.g. a TIER1 required as CORE for a particular domain). Also, how to manage TIER1 variables in the main CORDEX request dropped by a particular domain?

Having the main CORDEX request as reference plus domain-specific requests might simplify the creation of a central data-request (dropping priorities) that can be used to generate the CMOR tables. To avoid having to cross-check repeated attributes, we could have a central table with all attributes (except priority) for each ['out_name', 'frequency'] and the actual data request tables contain only ['out_name', 'frequency', 'priority'] columns. This would be a kind of relational database with ['out_name', 'frequency'] as primary key.

larsbuntemeyer commented 1 month ago

Ok, i agree, the priorities are an issue. The relational thing sounds good, we could have the data requests containing only attributes you suggested (and humans are actually interested in)!