ecmwf / earthkit-data

A format-agnostic Python interface for geospatial data
Apache License 2.0
47 stars 9 forks source link

`from_source` is broken when readers are imported #375

Closed malmans2 closed 1 month ago

malmans2 commented 1 month ago

What happened?

I need to check in my code the object instantiated by from_source. For example, I need to do this: isinstance(ds, GRIBReader) However, when I import the reader from_source breaks.

What are the steps to reproduce the bug?

import earthkit.data
from earthkit.data.readers.grib.reader import GRIBReader

collection_id = "reanalysis-era5-single-levels"
request = {
    "variable": "2t",
    "product_type": "reanalysis",
    "date": "2012-12-01",
    "time": "12:00",
}
ds = earthkit.data.from_source("cds", collection_id, **request, format="grib") 
# TypeError: 'module' object is not callable 

Version

0.7.0

Platform (OS and architecture)

Linux eqc-quality-tools.eqc.compute.cci1.ecmwf.int 5.14.0-362.8.1.el9_3.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Nov 8 17:36:32 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Relevant log output

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[1], line 11
      4 collection_id = "reanalysis-era5-single-levels"
      5 request = {
      6     "variable": "2t",
      7     "product_type": "reanalysis",
      8     "date": "2012-12-01",
      9     "time": "12:00",
     10 }
---> 11 ds = earthkit.data.from_source("cds", collection_id, **request, format="grib")

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/sources/__init__.py:155, in from_source(name, lazily, *args, **kwargs)
    153 while src is not prev:
    154     prev = src
--> 155     src = src.mutate()
    156 return src

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/sources/file.py:77, in FileSource.mutate(self)
     73     return FileIndexedSource(self.path, filter=filter, merger=self.merger, **kw)
     75 # Give a chance to directories and zip files
     76 # to return a multi-source
---> 77 source = self._reader.mutate_source()
     78 if source not in (None, self):
     79     source._parent = self

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/sources/file.py:96, in FileSource._reader(self)
     93 @property
     94 def _reader(self):
     95     if self._reader_ is None:
---> 96         self._reader_ = reader(
     97             self, self.path, content_type=self.content_type, parts=self.parts
     98         )
     99     return self._reader_

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/readers/__init__.py:184, in reader(source, path, **kwargs)
    180     magic = f.read(n_bytes)
    182 LOG.debug("Looking for a reader for %s (%s)", path, magic)
--> 184 return _find_reader(
    185     "reader",
    186     source,
    187     path,
    188     magic=magic,
    189     **kwargs,
    190 )

File /data/common/miniforge3/envs/wp3/lib/python3.11/site-packages/earthkit/data/readers/__init__.py:130, in _find_reader(method_name, source, path_or_data, **kwargs)
    126 for deeper_check in (False, True):
    127     # We do two passes, the second one
    128     # allow the plugin to look deeper in the buffer
    129     for name, r in _readers(method_name).items():
--> 130         reader = r(source, path_or_data, deeper_check=deeper_check, **kwargs)
    131         if reader is not None:
    132             return reader.mutate()

TypeError: 'module' object is not callable

Accompanying data

No response

Organisation

B-Open / CADS-EQC

sandorkertesz commented 1 month ago

Thanks for reporting this issue. This should not crash, however, the reader is an internal object and should not be used like that, since the underlying code can be refactored any time. To overcome this I'd rather add a method to the object returned from from_source() to perform the checks you need.

malmans2 commented 1 month ago

OK, thanks!

For context, the kwargs accepted by to_xarray are not fixed (e.g., csv doesn't accept open_mfdataset kwargs but accepts pandas read_csv kwargs). Furthermore, earthkit changes the default of xarray's backend-specific kwargs (e.g., the use of squeeze=False for cfgrib is not compatible with our code).

We need a way to discriminate which kwargs we can pass to to_xarray.

sandorkertesz commented 1 month ago

OK, thanks!

For context, the kwargs accepted by to_xarray are not fixed (e.g., csv doesn't accept open_mfdataset kwargs but accepts pandas read_csv kwargs). Furthermore, earthkit changes the default of xarray's backend-specific kwargs (e.g., the use of squeeze=False for cfgrib is not compatible with our code).

We need a way to discriminate which kwargs we can pass to to_xarray.

I am not sure this comment belongs to this issue.

malmans2 commented 1 month ago

the reader is an internal object and should not be used like that, since the underlying code can be refactored any time. To overcome this I'd rather add a method to the object returned from from_source() to perform the checks you need.

I just wanted to provide more details about the use we are doing as you mentioned that we should not import the reader class and a new method will be added:

if isinstance(earthkit_ds, GRIBReader):
    xr_ds = earthkit_ds.to_xarray(xarray_open_dataset_kwargs={"squeeze": False, "chunks": {}})
elif isinstance(earthkit_ds, CSVReader):
    xr_ds = ds.to_xarray(pandas_read_csv_kwargs=...)
elif ...:
    ...
else:
    xr_ds = earthkit_ds.to_xarray(xarray_open_mfdataset_kwargs=...)