Open larsbuntemeyer opened 1 month ago
my prototype code to create excel with 3 sheets (one per priority):
import pandas as pd
cols = ['out_name', 'standard_name', 'long_name', 'cell_methods']
def human_readable(df):
return df.groupby(cols)['frequency'].apply(list).to_frame()#.reset_index()
#df.groupby('priority').apply(human_readable, include_groups=False)
df = pd.read_csv("CORDEX-CMIP6/data-request.csv")
#human_readable(df)
sheets = {k: human_readable(v) for k, v in df.groupby('priority')}
with pd.ExcelWriter('data-request.xlsx') as writer:
for k, v in sheets.items():
v.to_excel(writer, sheet_name=k, index=True)
nlevels = v.index.nlevels + len(v.columns)
worksheet = writer.sheets[k] # pull worksheet object
for i in range(nlevels):
worksheet.set_column(i, i, 40)
This would, e.g., recreate more or less the original google docs spreadsheet we started with....
Hi @larsbuntemeyer
With this setup (additional column for domain_id
) it seems harder to account for different priorities for a variable across domains (e.g. a TIER1 required as CORE for a particular domain). Also, how to manage TIER1 variables in the main CORDEX request dropped by a particular domain?
Having the main CORDEX request as reference plus domain-specific requests might simplify the creation of a central data-request (dropping priorities) that can be used to generate the CMOR tables. To avoid having to cross-check repeated attributes, we could have a central table with all attributes (except priority) for each ['out_name', 'frequency']
and the actual data request tables contain only ['out_name', 'frequency', 'priority']
columns. This would be a kind of relational database with ['out_name', 'frequency']
as primary key.
Ok, i agree, the priorities are an issue. The relational thing sounds good, we could have the data requests containing only attributes you suggested (and humans are actually interested in)!
We would like to be able to manage domain specific data requests. I think we could add a colum like
domain_id
to the csv table that's being empty by default since the main data request is valid for all domains. If a specific variable at a certain frequency is only requested by a certain domain/community in addition to the default data request, thedomain_id
column get's an entry. This way, we can easily sort out domain specific data requests later, e.g., split into domain specific lists of variables, etc... However, the cmor tables would always contain all required entries. We can also easily create, e.g., execl sheets or html tables sorted, e.g., by the domain_id or priority, etc...