Open bascrezee opened 5 years ago
@bascrezee yes, not writing to disk is exemplified here https://github.com/ESMValGroup/ESMValCore/pull/307 - but I don't understand your issue -> if it's an empty (default
) preprocessor, then the preprocc-ed files still need to be written to disk for the diagnostic (note that the empty, as you put it, default
as it is called, preproc actually does a bunch of things: time extraction, cmor checks and fixes, so those files are needed nonetheless). If it's a more complex preprocessor then you must have save_intermediary_cubes: false
in config, but otherwise you can't really go around not writing the final output of the preprocessor to disk for the reasons I explained in the previous sentence :beer:
EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?
@bascrezee yes, not writing to disk is exemplified here #307
This is cool, didn't see it yet. It will help me a lot when testing https://github.com/ESMValGroup/ESMValTool/pull/1370 on CMIP6 data (since not every xxFrac is available for every model...).
EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?
Yes, that is exactly what I was thinking about. When you have a diagnostic where it is not possible to significantly reduce datasize within the preprocessor, this is exactly what one wants. This is the case for ESMValGroup/ESMValTool-private#208, although it would still be possible to restructure the code. This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.
Still, to me, it seems a bit strange, that, for OBS, one writes a specific reformatting script with the aim of having the data easily read into ESMValTool. However, then, even if one explicitly tells no preprocessor
the data is still written to disk again, thereby significantly slowing down performance. I think the not writing to disk after preprocessor will be a prerequisite when working with larger datasets (see e.g. https://github.com/ESMValGroup/ESMValTool/issues/1109).
@bascrezee yes, not writing to disk is exemplified here #307
This is cool, didn't see it yet. It will help me a lot when testing ESMValGroup/ESMValTool#1370 on CMIP6 data (since not every xxFrac is available for every model...).
cool - maybe you can give it a test and provide some feedback (sorry, had to enlist you since you complimented it :grin: )
EDIT: unless - you pass the preprocc-ed files straight to the diagnostic in memory as cubes and not save them to disk only after the diagnostic. This is something possible only in Python (iris cubes) that we may want to have but am afraid it may need quite a bit of work given the current, rather bespoke, infrastructure... @bouweandela ?
Yes, that is exactly what I was thinking about. When you have a diagnostic where it is not possible to significantly reduce datasize within the preprocessor, this is exactly what one wants. This is the case for ESMValGroup/ESMValTool-private#208, although it would still be possible to restructure the code. This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.
Still, to me, it seems a bit strange, that, for OBS, one writes a specific reformatting script with the aim of having the data easily read into ESMValTool.
well, this is for portability and reuse purposes - so one just reuses that cmorized data again without re-cmorizing it
However, then, even if one explicitly tells
no preprocessor
the data is still written to disk again, thereby significantly slowing down performance. I think the not writing to disk after preprocessor will be a prerequisite when working with larger datasets (see e.g. ESMValGroup/ESMValTool#1109).
yeps, totally agree, lemme ping @bouweandela on this one and see what we can do :beer:
Self-assigned to check this in future :)
Hi Bas,
The reasons for writing to disk are pretty much what V explained in his intial comment https://github.com/ESMValGroup/ESMValCore/issues/325#issuecomment-544553458: the ability to write diagnostics in other languages than Python and that there is a big difference between no preprocessing and the default preprocessor. I should probably add a note about this to the documentation sometime soon. For now you can see the default settings here: https://github.com/ESMValGroup/ESMValCore/blob/30de8d6a99c96e485781153b32ba7c389a6bfdea/esmvalcore/_recipe.py#L277-L359
You can disable these steps by setting them to false
in the recipe, e.g. extract_time: false
, but at the moment it is not possible to not save.
This would only mean that the recipe grows pretty large with one preprocessor defined per extreme event.
This is fine, if we see that many recipes grow very long and difficult to read, we can think about improving the recipe format so things can be written down more compactly. However, to be able to do that, we first need example use cases.
When working on https://github.com/ESMValGroup/ESMValTool-private/issues/208 with moderately high resolution data (6-hourly) over a long time period (40y) it takes a couple of minutes for the empty (!) preprocessor to write a file to disk (~180 GB). Is there an option to switch this off, i.e. give the diagnostic direct access to the files and only write to disk the final results. Several processing steps are done within the diagnostic (using the preprocessor functions) by design.