ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

Run preprocessor without writing out a file #325

Open bascrezee opened 5 years ago

bascrezee commented 5 years ago

When working on https://github.com/ESMValGroup/ESMValTool-private/issues/208 with moderately high resolution data (6-hourly) over a long time period (40y) it takes a couple of minutes for the empty (!) preprocessor to write a file to disk (~180 GB). Is there an option to switch this off, i.e. give the diagnostic direct access to the files and only write to disk the final results. Several processing steps are done within the diagnostic (using the preprocessor functions) by design.

valeriupredoi commented 5 years ago

@bascrezee yes, not writing to disk is exemplified here https://github.com/ESMValGroup/ESMValCore/pull/307 - but I don't understand your issue -> if it's an empty (default) preprocessor, then the preprocc-ed files still need to be written to disk for the diagnostic (note that the empty, as you put it, default as it is called, preproc actually does a bunch of things: time extraction, cmor checks and fixes, so those files are needed nonetheless). If it's a more complex preprocessor then you must have save_intermediary_cubes: false in config, but otherwise you can't really go around not writing the final output of the preprocessor to disk for the reasons I explained in the previous sentence :beer:

EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?

bascrezee commented 5 years ago

@bascrezee yes, not writing to disk is exemplified here #307

This is cool, didn't see it yet. It will help me a lot when testing https://github.com/ESMValGroup/ESMValTool/pull/1370 on CMIP6 data (since not every xxFrac is available for every model...).

EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?

Yes, that is exactly what I was thinking about. When you have a diagnostic where it is not possible to significantly reduce datasize within the preprocessor, this is exactly what one wants. This is the case for ESMValGroup/ESMValTool-private#208, although it would still be possible to restructure the code. This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.

Still, to me, it seems a bit strange, that, for OBS, one writes a specific reformatting script with the aim of having the data easily read into ESMValTool. However, then, even if one explicitly tells no preprocessor the data is still written to disk again, thereby significantly slowing down performance. I think the not writing to disk after preprocessor will be a prerequisite when working with larger datasets (see e.g. https://github.com/ESMValGroup/ESMValTool/issues/1109).

valeriupredoi commented 5 years ago

@bascrezee yes, not writing to disk is exemplified here #307

This is cool, didn't see it yet. It will help me a lot when testing ESMValGroup/ESMValTool#1370 on CMIP6 data (since not every xxFrac is available for every model...).

cool - maybe you can give it a test and provide some feedback (sorry, had to enlist you since you complimented it :grin: )

EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?

Yes, that is exactly what I was thinking about. When you have a diagnostic where it is not possible to significantly reduce datasize within the preprocessor, this is exactly what one wants. This is the case for ESMValGroup/ESMValTool-private#208, although it would still be possible to restructure the code. This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.

Still, to me, it seems a bit strange, that, for OBS, one writes a specific reformatting script with the aim of having the data easily read into ESMValTool.

well, this is for portability and reuse purposes - so one just reuses that cmorized data again without re-cmorizing it

However, then, even if one explicitly tells no preprocessor the data is still written to disk again, thereby significantly slowing down performance. I think the not writing to disk after preprocessor will be a prerequisite when working with larger datasets (see e.g. ESMValGroup/ESMValTool#1109).

yeps, totally agree, lemme ping @bouweandela on this one and see what we can do :beer:

BenMGeo commented 5 years ago

Self-assigned to check this in future :)

bouweandela commented 5 years ago

Hi Bas,

The reasons for writing to disk are pretty much what V explained in his intial comment https://github.com/ESMValGroup/ESMValCore/issues/325#issuecomment-544553458: the ability to write diagnostics in other languages than Python and that there is a big difference between no preprocessing and the default preprocessor. I should probably add a note about this to the documentation sometime soon. For now you can see the default settings here: https://github.com/ESMValGroup/ESMValCore/blob/30de8d6a99c96e485781153b32ba7c389a6bfdea/esmvalcore/_recipe.py#L277-L359 You can disable these steps by setting them to false in the recipe, e.g. extract_time: false, but at the moment it is not possible to not save.

This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.

This is fine, if we see that many recipes grow very long and difficult to read, we can think about improving the recipe format so things can be written down more compactly. However, to be able to do that, we first need example use cases.