Open coxipi opened 10 months ago
to_dataset_dict()
does more than just open the files. It groups together the files associated to a given dataset based on aggregation controls specified in the JSON (by default: id, processing_level, domain, frequency
). There's also a semi-custom call to open_dataset --> combine_by_coords
, instead of open_mfdatasets
, although I don't quite remember their reasoning behind it.
For very big catalogs, I could thus see a substantial difference in speed compared to simply opening the files.
That being said, we could see if there are speedups to be accomplished.
@coxipi Is your catalog supposed to have aggregation, or is it indeed just a list of independent datasets ?
The aggregation can often be sped up with passing there to to_dataset_dict
:
xarray_combine_by_coords_kwargs={'data_vars': 'minimal', 'coords': 'minimal', 'compat': 'override'}
assuming all the elements to be aggregated are well behaved (no overlap between files, all variables of the same name have the same dimensions and the exact same coordinates on the non-appended dims, etc).
Not sure what you mean by "independent datasets". Each key in the dataset dict represents a different simulation (each with its own single path to a zarr) as created in previous steps of the xscen workflow.
I meant that they are not meant to be unified into a single dataset in the same way a open_mfdataset
would act.
In that case, I'm not sure why to_dataset_dict
would be dramatically slower than your function...
There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.
@RondeauG, in to_dataset_dict
the aggregation is entirely driven by the catalog columns and configuration. In open_mfdataset
, the aggregation is guessed by xarray by analyzing the coordinates.
Note: if the path
column contains a *
, open_mfdataset
will be used, so one can combine both methods.
Setup Information
Context
I store my files on "jarre", which is considered a slow disk AFAIK.
Sometimes,
pcat.search(...).to_dataset_dict()
will take forever to access my files (bad behaviour), while this homemade function:has a speed which is similar to the good expected behaviour of
pcat.search(...).to_dataset_dict()
.I can't tell what conditions on the server could be related to this problem. The problem sometimes comes, stays for a bit, and then stops.
Is this issue known?