Expose dataloader params and fix for multiple datasets

alan-turing-institute / cloudcasting

Cloud modelling (Manchester Prize)

MIT License

2 stars 0 forks source link

Expose dataloader params and fix for multiple datasets #57

Closed dfulu closed 1 month ago

dfulu commented 1 month ago

I have started using the dataloader to train a model.

I found that without setting the chunks="auto" param in xr.open_dataset() that I get a memory error when trying to load multiple zarr files.

I think this may be an edge case of using open_dataset alongside combine_nested. Another alternative would be to replace xr.open_dataset(path, engine="zarr", chunks="auto") with xr.open_zarr(path) which might be cleaner.

I've also exposed a couple of the dataloader settings

codecov-commenter commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Please upload report for BASE (main@b652aca). Learn more about missing BASE report.

Additional details and impacted files

```diff @@ Coverage Diff @@ ## main #57 +/- ## ======================================= Coverage ? 94.88% ======================================= Files ? 11 Lines ? 469 Branches ? 0 ======================================= Hits ? 445 Misses ? 24 Partials ? 0 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

phinate commented 1 month ago

Out of curiosity: what's the specific error here? Nice catch though, and I'm always in favour of exposing more parameters.

dfulu commented 1 month ago

I was getting a memory error with the previous settings. It seems like the combination of settings we used was causing the whole array to be loaded eagerly into memory

phinate commented 1 month ago

Yes, I would concur — I've faced this exact issue (chunks=None means you don't use Dask, so you will load it eagerly as you say). Just wanted to make sure I understood!