ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
40 stars 36 forks source link

Write settings.yml in original order #2339

Open enekomartinmartinez opened 4 months ago

enekomartinmartinez commented 4 months ago

This is not a bug or a real feature but a minimal change for convenience.

As a user, I usually need to go to the settings.yml to make small changes and relaunch a diagnostic.

Currently, this file is written here: https://github.com/ESMValGroup/ESMValCore/blob/bb7866e6a136f6e011bc6419c4e3ed9eec25721e/esmvalcore/_task.py#L425

which uses the default order by YAML and orders the entries alphabetically, and then also mixes user parameters with ESMValTool parameters (like workdir, plotdir, output_file_type...).

My experience as the user would prefer to write it in the original order. This will help to make changes and compare with the full recipe easily, in particular in the case that the arguments have been intentionally ordered somehow.

This should be easy to do with the argument sort_keys=False as it is done to save the full recipe.

Is a minor change, but I could make it if you agree with it. I know that is more of a preference than a bug or feature. However, after discussing with some colleagues, they also think that it would be better to keep the same order from the recipe in the settings.yml file.

bouweandela commented 4 months ago

It should be fine to change that.

Are you aware of the --resume-from command line or esmvalcore.config.CFG['resume_from'] option though? That seems a much more user-friendly way of re-running a diagnostic. It can be combined with the --diagnostics / esmvalcore.config.CFG['diagnostics'] option to only run specific diagnostic scripts should that be needed. Maybe it would be nicer to make the check that the recipe has not changed since the previous run smarter, so it allows for different settings in the diagnostic script sections (only preprocessor data is re-used anyway, so this should be quite safe). The code for that check lives here: https://github.com/ESMValGroup/ESMValCore/blob/c9a59821b03481fe29ac2e0beb797f1300a938b6/esmvalcore/_main.py#L66-L72

enekomartinmartinez commented 4 months ago

Great!

Yes, we are aware of that! I usually modify the settings because I have several scripts concatenated, some of which may take a lot of time. Something like:

diagnostics:
  dignostic_name:
    variables:
      [...]
   scripts:
     script_1: [...]
     script_2: [...]
     script_3: [...]

If script_1 takes a lot of time, I prefer to adjust the settings.yml of script_3 (for example, changing some parameters to have a nicer plot). This way, I save computation time (I work with km-scale data :sweat_smile: ). That's why I prefer to work this way.

Something that may happen is that if script_2 fails (due to an error in the script), you can easily correct and rerun that. But later, you must create a run folder and settings.yml for script_3. So, make --resume_from more flexible to allow indicating the script level in the diagnostic and thus create the specific settings.yml for that script it would be very nice. I cannot work on that right now, but if you think it could be an interesting feature, maybe we can open another issue in case anyone else could make it.

bouweandela commented 4 months ago

I'm not sure if making resume-from work with diagnostic scripts is desirable. In general, we would like to make results reproducible and people tend to tinker a lot with those scripts, so if you get your run to go through like that, there is no guarantee that it will work if you would run things from scratch.

If script_1 takes a lot of time, I prefer to adjust the settings.yml of script_3 (for example, changing some parameters to have a nicer plot). This way, I save computation time (I work with km-scale data 😅 ). That's why I prefer to work this way.

Out of curiosity: typically, the idea is that data size is already greatly reduced by using the preprocessor functions, so diagnostic scripts are mostly there for the final part of the analysis and for plotting. Are you missing certain preprocessor functions that make this impossible?

enekomartinmartinez commented 4 months ago

I'm not sure if making resume-from work with diagnostic scripts is desirable. In general, we would like to make results reproducible and people tend to tinker a lot with those scripts, so if you get your run to go through like that, there is no guarantee that it will work if you would run things from scratch.

I'm okay with that then

Out of curiosity: typically, the idea is that data size is already greatly reduced by using the preprocessor functions, so diagnostic scripts are mostly there for the final part of the analysis and for plotting. Are you missing certain preprocessor functions that make this impossible?

No, current preprocessors are okay for my analysis. However, sometimes I have to keep 3D data and I cannot reduce dimensionality a lot because I have non-linear operations to apply to the data, so I have to compute the spatial/volume averages after the diagnostic that does the operations.