OpenDrift / trajan

Trajectory analysis package for simulated and observed trajectories
https://opendrift.github.io/trajan/
GNU General Public License v2.0
11 stars 5 forks source link

should we recommend / illustrate / discuss the use of .nc compression, if not in trajan core, at least in an example? #144

Open jerabaul29 opened 7 hours ago

jerabaul29 commented 7 hours ago

This issue is motivated by the following: .nc is the file obtained by .to_netcdf(), .zip is zipping the .nc file in my file explorer:

$ ls -lrth dataset_trajectories_to_use.*
-rw-rw-r-- 1 jeanr jeanr 8,5M nov.  14 15:54 dataset_trajectories_to_use.zip
-rw-rw-r-- 1 jeanr jeanr 100M nov.  14 16:00 dataset_trajectories_to_use.nc

clearly the .nc I had was not effectively compressed at all...

Should this be discussed in some example, and / or should we provide a "reasonable zipping for our typical use / needs as encountered in trajan" .to_netcdf() wrapper, or do you think this is outside the scope of trajan?

I guess for example that in our case, that is trajectory-focused, it could be realistic to compress each variable trajectory independently, so that we get good compression factor, and at the same time accessing any variable for one single trajectory would still be fast (ie need only to read and uncompress the compressed chunk that contains only this variable for the corresponding trajectory).

gauteh commented 6 hours ago

Yes, this is an annoying thing with xarray. I like examples and maybe there is a good way to do it, and probably there is xarray documentation. I personally use this:

https://github.com/gauteh/plz/blob/15300e4237c7071a670b8b7e8e6b101b01cab9b6/plz/xr.py#L72

Then I can do:

da.to_netcdf(encoding=plz.xr.nc_cmp(da))

jerabaul29 commented 6 hours ago

nice, yes this is exactly what I had in mind regarding the way to compress :) I can add an example about this! :)

The questions is, do we want to have this "just as an example", or as a default in trajan given that trajan is trajectory-focused which fits naturally well (I would be surprised if anyone complains about "per trajectory" variable compression)? What do you think?

gauteh commented 5 hours ago

It is a bit tricky to make general, and it will not be the right choice if Trajan is used to generate model output. At least intermediate output. If it can be made generic?

knutfrode commented 4 hours ago

What about making a wrapper of to_netcdf() (i.e. ds.traj.to_netcdf()) that makes typical (e.g. "per trajectory") chunking/compression by default?

Btw, sometimes (e.g. for simulated datasets) selecting a subset of time could as relevant as selecting subsets of trajectories. Thus we could have simple options to determine chunking size per dimension.