pcat.search(...).to_dataset_dict() sometimes slower than it should

Ouranosinc / xscen

A climate change scenario-building analysis framework.

https://xscen.readthedocs.io/

Apache License 2.0

13 stars 2 forks source link

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

Open coxipi opened 10 months ago

coxipi commented 10 months ago

Setup Information

xscen version: 0.6.12-beta
Python version: 3.11.4
Operating System: CentOs 7 (Doris)

Context

I store my files on "jarre", which is considered a slow disk AFAIK.

Sometimes, pcat.search(...).to_dataset_dict() will take forever to access my files (bad behaviour), while this homemade function:

def my_search(kwargs):
    paths = list(pcat.search(**kwargs).df.path)
    return {p:xr.open_zarr(p) for p in paths}

has a speed which is similar to the good expected behaviour of pcat.search(...).to_dataset_dict() .

I can't tell what conditions on the server could be related to this problem. The problem sometimes comes, stays for a bit, and then stops.

Is this issue known?

RondeauG commented 10 months ago

to_dataset_dict() does more than just open the files. It groups together the files associated to a given dataset based on aggregation controls specified in the JSON (by default: id, processing_level, domain, frequency). There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

For very big catalogs, I could thus see a substantial difference in speed compared to simply opening the files.

That being said, we could see if there are speedups to be accomplished.

aulemahal commented 10 months ago

@coxipi Is your catalog supposed to have aggregation, or is it indeed just a list of independent datasets ?

The aggregation can often be sped up with passing there to to_dataset_dict:

xarray_combine_by_coords_kwargs={'data_vars': 'minimal', 'coords': 'minimal', 'compat': 'override'}

assuming all the elements to be aggregated are well behaved (no overlap between files, all variables of the same name have the same dimensions and the exact same coordinates on the non-appended dims, etc).

coxipi commented 10 months ago

Not sure what you mean by "independent datasets". Each key in the dataset dict represents a different simulation (each with its own single path to a zarr) as created in previous steps of the xscen workflow.

aulemahal commented 10 months ago

I meant that they are not meant to be unified into a single dataset in the same way a open_mfdataset would act.

In that case, I'm not sure why to_dataset_dict would be dramatically slower than your function...

aulemahal commented 10 months ago

There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

@RondeauG, in to_dataset_dict the aggregation is entirely driven by the catalog columns and configuration. In open_mfdataset, the aggregation is guessed by xarray by analyzing the coordinates.

Note: if the path column contains a *, open_mfdataset will be used, so one can combine both methods.