ESMValGroup / ESMValCore

ESMValCore: A community tool for pre-processing data from Earth system models in CMIP and running analysis scripts.
https://www.esmvaltool.org
Apache License 2.0
42 stars 38 forks source link

`Dataset.load()` not only loads but also writes a lot of output to disk if `save_intermediary_cubes=True` #2165

Open valeriupredoi opened 1 year ago

valeriupredoi commented 1 year ago

To my understanding overloading Dataset outside its specs (ie to be used inside a preprocessing function, and directly calling its methods) should not write anything unless specifically asked to via a save method. Instead, it actually does save to disk a lot of data if save_intermediary_cubes is set to True. This process (identical to the actual workflow process) is slow, memory-intesive, and eating up disk: I had save_intermediary_cubes set to True in my user config, as a result the loading of a Dataset object leads to the unwanted creation of an esmvaltool_output dir, with a session hash, where intermediary files are stored:

(esmvaltool) valeriu@valeriu-PORTEGE-Z30-C:~$ ls -la esmvaltool_output/session-39cda3ae-7368-4e45-99ef-d7649845bfaf_20230804_121819/preproc/CMIP6_CESM2_fx_piControl_\*_areacella_gn/
total 1496
drwxrwxr-x 2 valeriu valeriu   4096 Aug  4 13:23 .
drwxrwxr-x 3 valeriu valeriu   4096 Aug  4 13:23 ..
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 00_load.nc
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 01_fix_metadata.nc
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 02_concatenate.nc
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 03_cmor_check_metadata.nc
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 04_fix_data.nc
-rw-rw-r-- 1 valeriu valeriu 250534 Aug  4 13:23 05_cmor_check_data.nc

First off - I don't like this, am fairly sure users are not aware of such a behaviour - this is under the hood behaviour that may lead to problems (it is undocumented AFAIK, and such data output doesn't happen if one doesn't turn on intermediary saves), and second - if this is an intended behaviour, we should document it :beer: First discovered in https://github.com/ESMValGroup/ESMValCore/issues/2162

bouweandela commented 1 year ago

This is the intended behaviour. Dataset.load runs all preprocessor steps required to load the data (download, fixes, cmor check, concatenate, clip timerange, add supplementary variables) and respects the settings in its session attribute. If the users of the Dataset class does not set the session attribute, an esmvalcore.config.Session is automatically started and used.

valeriupredoi commented 1 year ago

thanks @bouweandela - I understand that, and it's a gud boi Dataset, all I need to see is something that prints to screen "hello, I am going to fill up your local disk with junk bc you are silly and have forgotten to unset save_intermediary_cubes in CFG" :grin: I'll open a PR for that :+1: