Open ceblanton opened 6 months ago
cat = cat.search(variable_id="high_cld_amt") dset_dict = cat.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})
--> The keys in the returned dictionary of datasets are constructed as follows: 'source_id.experiment_id.frequency.modeling_realm.variable_id.chunk_freq'
████████████████████████████████████████████████████████████████████████████████████████| 100.00% [2/2 00:04<00:00] dset_dict.keys() dict_keys(['am5.c96L65_am5f7b11r0_amip.P1M.atmos_level.high_cld_amt.P1Y', 'am5.c96L65_am5f7b11r0_amip.P1M.atmos.high_cld_amt.P1Y'])
@ceblanton member_id is empty "" , when it's empty the logic in Ray's script perhaps should be to remove it in key name?
or we enforce no null which may be something we discussed before.
on May 9th, it was decided to use "na" as the default value for the aggregate columns rather than the empty values, to help maintain a "key pattern" at the early stage of adopting this. Down the line, we will provide examples to dynamically query for the dataset/key names.
@ceblanton
PR is ready for member_id to be "na" by default. But, I realize Ray's key still is missing the chunk frequency which is an aggregate column. I am not sure if leaving it in the key or using a default for chunk_freq is a good idea. We can't possibly find unique datasets without that. But this also circles back to not having to hard-code these key names.
this now works:
am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y
You can test:
import intake, intake_esm
cat = /home/a1r/cat/canopy/am5f7b11r0/c96L65_am5f7b11r0_amipn0513.json
import intake,intake_esm
cat = intake.open_esm_datastore(col)
cat_store = intake.open_esm_datastore(cat)
cat_subset = cat_store.search(variable_id="high_cld_amt")
dset_dict = cat_subset.to_dataset_dict(cdf_kwargs={'chunks': {'time':5}, 'decode_times': False})
#this gives the dataset names dynamically based on the search and existing catalog+spec.
for k in dset_dict.keys():
print(k)
#test for the new key that is expected to work
dset_dict['am5.c96L65_am5f7b11r0_amip.P1M.na.atmos_level.high_cld_amt.P1Y']
figure generated : /nbhome/a1r/analysis-scripts/pngs/cloud-fraction.png
script used: https://github.com/aradhakrishnanGFDL/analysis-scripts/blob/prototype1-a1r/raytest.py
changes made are in my fork and its only for one suite
https://github.com/aradhakrishnanGFDL/analysis-scripts/tree/prototype1-a1r/freanalysis_clouds
to support this, we need to remove source_id from the aggregation columns. MDTF uses it though. so let's discuss.. @ceblanton
FRE Canopy is generating catalogs using:
module load fre/canopy
fre catalog build --overwrite -i $ppdir -o $ppdir/catalog
sed -i.bak -e 's/,P1M,/,monthly,/' $ppdir/catalog.csv
An example pp directory and catalog file are here:
The example analysis script usage (the Ray example) is:
That fails with this message
The mystery is that this very-similar catalog works:
/net2/rlm/analysis-scripts/example/catalog.json
The difference we think is "n/a" versus missing for the ensemble vocabulary.
Hopefully, the "fre catalog validate /path/to/schema.json /path/to/catalog-to-test.json" usage can detect this mismatch or inconsistency before we try to launch the script.