Setting chunks auto in open_mfdataset

SarahAlidoost commented 9 months ago

close #94

In this PR:

Set chunks to "auto" to avoid memory issues in xr.open_mfdataset, because, by default, chunks will be chosen to load entire input files into memory at once. see doc.
In timestep "1800S", S is replaced with s to fix pandas: FutureWarning: 'S' is deprecated and will be removed in a future version, please use 's' instead. This works also for pandas < 2, see source code.
Set dask.config.set({"array.slicing.split_large_chunks": True}) to avoid creating the large chunk, because of PerformanceWarning: Slicing is producing a large chunk, see doc.

There is still another PerformanceWarning: Increasing number of chunks by factor. This is due to internal re-chunking and might be solved by zampy. see dask source code.

SarahAlidoost commented 9 months ago

I'm glad you were able to find a way to fix this!

I have also found that open_mfdataset can be quite slow. In cases where you have big datasets, and know well how to concatenate/merge the data, opening the files separately and then defining the merging operations manually can lead to better performance.

The code here is fine as is, it'll be mostly replaced anyway once we move to Zampy's output.

thanks. I added other changes see here, can you have another look?

sonarcloud[bot] commented 9 months ago

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
100.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

EcoExtreML / STEMMUS_SCOPE_Processing

Setting chunks auto in open_mfdataset #95

Quality Gate passed