MetOffice / CSET

Toolkit for evaluation and investigation of numerical models for weather and climate applications.
https://metoffice.github.io/CSET/
Apache License 2.0
9 stars 4 forks source link

Handling multiple sets of data in cylc top level #322

Closed jfrost-mo closed 5 months ago

jfrost-mo commented 9 months ago

This might be model and obs, two different models, or even more complicated things. Basically anything that could be needed for comparing two datasets.

We need to figure out how to pass this data around, and feed it into operators.

jfrost-mo commented 8 months ago

We will need to loop over some things in cylc land.

The top level looping will be over input data retrieved in one go.

jfrost-mo commented 8 months ago

Multi-model Processing

jfrost-mo commented 7 months ago

To be able to find the partially processed files, we might want to use SHA(recipe.yaml) rather than a UUID. This would be the recipe after any templating.

This would allow us to easily re-find the output folder for a recipe, without needing to have a centralised database. We do need to clarify whether any recipes would be unwantedly merged, though if they are exactly the same probably not.

The meta.json file will probably be useful for synchronising metadata between steps.

jfrost-mo commented 7 months ago

parallel_recipes


multi-model


3_hourly

jfrost-mo commented 7 months ago

A second section will be needed in the recipe files for the collation steps. The initial processing steps will write out to an intermediate directory inside(? or maybe seperate, but with a derived name) the output directory. The collation step will read all the files in the intermediate directory, and make the final output. This will need the output directory to have a name deterministically derived from the recipe. (Hash the serialisation of the templated recipe?)

A second command to compliment cset bake (or maybe an option to it), will be needed for running the collation step.

Cycling will be done in the include file to start with, on validity time.

jfrost-mo commented 7 months ago

Regarding a seperate working directory, or having it inside the output dir:

Advantages to merging

Disadvantages to merging

Other points

Conclusion

Based on the above points I think we want to stick with the single output directory, though having an explicit "intermediate" directory within it. This would retain the ease of cleanup and seperation where useful, but also allow for direct IO on output files where useful. Care will have to be talken to ensure recipes don't conflict when outputting however,

jfrost-mo commented 7 months ago

Regarding the output directory name, I don't think hashing the entire recipe will work, as we want the output of the same recipe, but different validity times end up in the same output. Now we have the capibility to template inside the title, I think the title would be the best thing to differentiate on. This will also ensure every plot has a unique title, providing UX benefits.

We still need to consider that the templating happens inside cset bake, so we may need to create the output directory there.

jfrost-mo commented 7 months ago

Upon experimenting with adding the output creation to cset bake it became clear that we need the output folder in a few different places. As such I've now added another command to get the recipe id (derived from the recipe title). This means that the recipe ID can be retreived, and then existing commands can be used with that as a path.

jfrost-mo commented 7 months ago

Todo

Should also do

jfrost-mo commented 6 months ago

Other idea: Extend bake so you can specify what key it grabs the steps from, then just run bake specifying the input and output directories as the same. Not doing this as I think a seperate command is probably clearer.

jfrost-mo commented 6 months ago

cset collate (name subject to change) is now implemented and can run the steps defined in the post-steps (also subject to change) key in the recipe. The next steps are getting the cycling working.

jfrost-mo commented 6 months ago

To make cylcing on validity time convenient, the CYLC_TASK_CYLC_POINT environment variable will be automatically interpreted as a VALIDITY_TIME argument by the wrappers that run cset bake and collate.

jfrost-mo commented 6 months ago

Progress!

Cylc graph of workflow

jfrost-mo commented 6 months ago

Rather than having a seperate command we could have cset bake run the post-steps. I was wondering what to do with the default case however, and I think it would be good if it ran both the steps and the pre-steps. This way you could run a whole recipe with CSET bake, but would lose out on the cycling.

jfrost-mo commented 6 months ago

Housekeeping_full should not fail when the raw data has already been cleaned up.

jfrost-mo commented 6 months ago

Currently cycling works if you have the final cycle point as an integer multiple of the collate step frequency.

If you don't, and have a long run-ahead limit, then the final cycle point runs an unconstrained copy of the collate step, which will fail due to running before the conda environment has been setup, or the data fetched.

If you have a short run-ahead limit, then the workflow mysteriously stalls after running the processing tasks in the final cycle point, and never runs the collate step, or any later finishing steps.

Some discussion on fixing it is here: https://web.yammer.com/main/org/metoffice.gov.uk/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjY5NTY1MDYyMzQ4ODAwMCJ9

Conclutions is that there is no easy way to get around it, so will just constrain initial and final cycle points to a three hourly multiple for now.

jfrost-mo commented 6 months ago

For now the collation frequency is hard coded to three hours. In future we could make this configurable for the whole workflow, but to make it per-recipe seems difficult with my current understanding of cylc. TBF this might be a fine limitation, as the collation steps should be frequency insensitive anyway, and is really only a performance hack to allow most of the processsing to run in parallel.

jfrost-mo commented 6 months ago

File name examples

Key

Symbol Meaning
I... Forecast initiation time.
L... Forecast lead time.
V... Forecast validity time.
E... Ensemble member number.

Examples

UKV

             I.......... L..
prods_op_ukv_20240101_00_000.pp

MOGREPS-UK

                    I.......... E. L..
prods_op_mogreps-uk_20240308_06_05_114.pp

Coupled hindcast

Note that the T00Z for given forecast lead time is not actually contained within, and is instead the last point of the previous file.

                     V.........
enuk_amm15_amm15-SMC_2023070600_pp2_2023060100_pverb000.pp

Coupled 5 day ensemble

                     V.........     I.........      L.. E..
enuk_amm15_amm15-SMC_2023062100_pp4_2023061900_pverd048_030.pp

Coupled 5 day deterministic

                     V.........     I.........      L..
enuk_amm15_amm15-SMC_2023062200_pp3_2023061900_pverc072.pp

Regional climate UKCP18 RCM

Appears to be monthly output.

                                     V......
moose:/crum/mi-au861/ap1.pp/au861a.p12022jan.pp
jfrost-mo commented 6 months ago

fetch_fcst should now work for specific cycle times. Only the filesystem fetch has been implemented, but the complicated logic is in a seperate module that could be reused by other codes quite simply. It still needs a bit more rigorus testing, and to be documented in Sphinx, but is otherwise there.

jfrost-mo commented 6 months ago

There are different amounts of data in different files!!!

Specifically, the first file (for 2024-01-01T00) has three timesteps in it, rather than the usual two.

jfrost-mo commented 6 months ago

TODO: Figure out why the domain mean plots are not showing in the output, and why the spatial plots are not showing a time series.

jfrost-mo commented 6 months ago

I think the variaous bugs have now been fixed, and it just needs tests and documenting.

jfrost-mo commented 6 months ago

Housekeeping the intermediate data doesn't seem to work however. .../plot/... instead of .../plots/...

jfrost-mo commented 6 months ago

Regarding the cycling, the main issue is that the the collation tasks need to garrentee to run after all of the pre-processing tasks. If we could treat the last cycle as special, and hold it until all previous cycles had completed, that would solve our issue.

Looks like this should be possible: https://web.yammer.com/main/threads/eyJfdHlwZSI6IlRocmVhZCIsImlkIjoiMjcxNDY0NDQ2MTQ5NDI3MiJ9

jfrost-mo commented 6 months ago

When we get to sub-KM data there are some fields that are hourly and some are five minutely. These are not necissarily in different files. How de we set the combination of CSET_CYCLE_PERIOD and CSET_TIMES_PER_FILE to work for both?

E,g: 5 minute precip data, but 15 minute temperature data. (Probably research mode)