ecmwf / cfgrib

A Python interface to map GRIB files to the NetCDF Common Data Model following the CF Convention using ecCodes
Apache License 2.0
398 stars 75 forks source link

If opening with xarray open_mfdataset and parallel=True it will fail unless you have previously opened it with parallel=False #110

Open scottcha opened 4 years ago

scottcha commented 4 years ago

Minimal repro:

import xarray as xr
ds = xr.open_mfdataset('gfs.0p25.201511*00.f0*.grib2', engine='cfgrib', combine='nested', concat_dim=['step'], parallel=True, chunks=24, backend_kwargs={'filter_by_keys': {'typeOfLevel': 'surface'}, 'indexpath': ''})

Expected result: returns xarray Actual result:

ECCODES ERROR   :  grib_handle_create: cannot create handle, no definitions found
ecCodes assertion failed: `h' in /home/conda/feedstock_root/build_artifacts/eccodes_1570714279314/work/src/grib_query.c:529

Note if in the same session/kernel you have previously opened with parallel=False the above will pass. The repro needs to happen in a new session. This was executed on a local dask cluster.

alexamici commented 4 years ago

I confirm this bug report with a different dataset and different error messages.

With parallel=False open_mfdataset always work:

>>> import cfgrib
>>> import xarray as xr
>>> print(xr.__version__, cfgrib.__version__)
0.13.0 0.9.7.4.dev0
>>> ds = xr.open_mfdataset('step*.grib', engine='cfgrib', concat_dim=['step'], combine='nested', parallel=False)
>>> ds
<xarray.Dataset>
Dimensions:     (latitude: 1801, longitude: 3600, step: 3)
Coordinates:
    time        datetime64[ns] 2019-04-01
    number      int64 0
    surface     int64 0
  * latitude    (latitude) float64 90.0 89.9 89.8 89.7 ... -89.8 -89.9 -90.0
  * longitude   (longitude) float64 0.0 0.1 0.2 0.3 ... 359.6 359.7 359.8 359.9
  * step        (step) timedelta64[ns] 01:00:00 02:00:00 03:00:00
    valid_time  (step) datetime64[ns] 2019-04-01T01:00:00 ... 2019-04-01T03:00:00
Data variables:
    t2m         (step, latitude, longitude) float32 dask.array<chunksize=(1, 1801, 3600), meta=np.ndarray>
Attributes:
    GRIB_edition:            1
    GRIB_centre:             ecmf
    GRIB_centreDescription:  European Centre for Medium-Range Weather Forecasts
    GRIB_subCentre:          0
    Conventions:             CF-1.7
    institution:             European Centre for Medium-Range Weather Forecasts
    history:                 2019-11-11T19:19:05 GRIB to CDM+CF via cfgrib-0....

Restarting the kernel and running with parallel=True always crashes python inside ecCodes but it returns a few different error messages. I observed at leat:

ECCODES ERROR   :  Unable to find boot.def. Context path=/Users/amici/.conda/envs/ECM/share/eccodes/definitions

Possible causes:
- The software is not correctly installed
- The environment variable ECCODES_DEFINITION_PATH is defined but incorrect

ecCodes assertion failed: `0' in /usr/local/miniconda/conda-bld/eccodes_1566402639979/work/src/grib_context.c:205
ECCODES ERROR   :  grib_handle_create: cannot create handle, no definitions found
ecCodes assertion failed: `h' in /usr/local/miniconda/conda-bld/eccodes_1566402639979/work/src/grib_query.c:458
ECCODES ERROR   :  grib_parser: syntax error at line 34 of /Users/amici/.conda/envs/ECM/share/eccodes/definitions/boot.def
ECCODES ERROR   :  ecCodes Version: 2.13.1

and

ECCODES ERROR   :  ecCodes Version: 2.13.1
ecCodes Version:       2.13.1
Definition files path: /Users/amici/.conda/envs/ECM/share/eccodes/definitions
ECCODES ERROR   :  grib_parser_include: Could not resolve 'ECCODES_USE_' (included in /Users/amici/.conda/envs/ECM/share/eccodes/definitions/boot.def)
ecCodes assertion failed: `0' in /usr/local/miniconda/conda-bld/eccodes_1566402639979/work/src/grib_context.c:205

It looks like a locking/threading problem, @shahramn do you have any hint?

marcowurth commented 4 years ago

Any update on this @shahramn @alexamici or some kind of idea how deep the problem goes? I just updated cfgrib, eccodes, python-eccodes, dask and xarray through conda-forge and retried above minimal code with same issue:

>>> import cfgrib
>>> import xarray as xr
>>> import eccodes
>>> import dask
>>> print(cfgrib.___version__, xr.__version__, eccodes.__version__, dask.__version__)
0.9.8.1 0.15.1 2.17.0 2.14.0
>>> ds = xr.open_mfdataset('icon-eu-eps_europe_icosahedral_single-level_2019121918_*_t_2m.grib2',
                           engine='cfgrib', combine='nested', concat_dim=['step'], parallel=True,
                           backend_kwargs={'indexpath': ''})
ECCODES ERROR   :  grib_handle_create: cannot create handle, no definitions found
ecCodes assertion failed: `h' in /home/conda/feedstock_root/build_artifacts/eccodes_1583917083369/work/src/grib_query.c:568
Aborted (core dumped)
MatthewLennie commented 3 years ago

Minimal repro:

import xarray as xr
ds = xr.open_mfdataset('gfs.0p25.201511*00.f0*.grib2', engine='cfgrib', combine='nested', concat_dim=['step'], parallel=True, chunks=24, backend_kwargs={'filter_by_keys': {'typeOfLevel': 'surface'}, 'indexpath': ''})

Expected result: returns xarray Actual result:

ECCODES ERROR   :  grib_handle_create: cannot create handle, no definitions found
ecCodes assertion failed: `h' in /home/conda/feedstock_root/build_artifacts/eccodes_1570714279314/work/src/grib_query.c:529

Note if in the same session/kernel you have previously opened with parallel=False the above will pass. The repro needs to happen in a new session. This was executed on a local dask cluster.

I am also reproducing this error, while using:

blah = dask.delayed(cfgrib.open_datasets)(file_name,backend_kwargs={'indexpath': ''},cache = False,chunks = {}) blah = client.compute(blah_2)

expected = list(XR.dataset) result: KilledWorker: Dask The log files list the following: ECCODES ERROR : grib_handle_create: cannot create handle, no definitions found ecCodes assertion failed: `h' in /home/conda/feedstock_root/build_artifacts/eccodes_1593014857650/work/src/grib_query.c:572

The files open fine when run eagerly i.e. without the Dask.delayed.

Any work arounds?

I tried some additional checks. It seems that opening the files straight into memory i.e. blah = dask.delayed(cfgrib.open_datasets)(file,backend_kwargs={'indexpath': ''}) Then it works. It seems the problem is specifically trying to open the data as a Dask.array rather than loading into memory. The parallelization doesn't seem to be the problem. Hope this extra information helps narrow it down.

guidocioni commented 3 years ago

I can confirm this is still here on xarray 0.16.1 and cfgrib 0.9.8.4. For now I'm using parallel = False but it takes about 3 times longer than with parallel = True. The problem is that when opening the files for the first time with parallel = True eccodes throw an error to cfrgib which is unable to write idx files. The error which you then see in python is due to empty idx files.

MatthewLennie commented 3 years ago

That's interesting. Do you happen to have a theory of why this error would appear in parallel but not in serial?

On 06.10.2020 09:09, Guido Cioni wrote:

I can confirm this is still here on xarray 0.16.1 and cfgrib 0.9.8.4. For now I'm using parallel = False but it takes about 3 times longer than with parallel = True. The problem is that when opening the files for the first time with parallel = True eccodes throw an error to cfrgib which is unable to write idx files. The error which you then see in python is due to empty idx files.

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/ecmwf/cfgrib/issues/110#issuecomment-704077456 [2] https://github.com/notifications/unsubscribe-auth/ALQB7NIWSZYMVZKEROMGGM3SJK7AZANCNFSM4JKA7GMQ

MatthewLennie commented 3 years ago

Could it be that eccodes isn't thread safe some how? It seems that when manually open multiple files using CFgrib.open_datasets via multiple processes I don't get the error.

I do this by adding a resource spec of 1 process per task i.e. meaning that a single task will run per worker regardless of the number of threads.

Tentatively a work around?

On 06.10.2020 09:09, Guido Cioni wrote:

I can confirm this is still here on xarray 0.16.1 and cfgrib 0.9.8.4. For now I'm using parallel = False but it takes about 3 times longer than with parallel = True. The problem is that when opening the files for the first time with parallel = True eccodes throw an error to cfrgib which is unable to write idx files. The error which you then see in python is due to empty idx files.

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/ecmwf/cfgrib/issues/110#issuecomment-704077456 [2] https://github.com/notifications/unsubscribe-auth/ALQB7NIWSZYMVZKEROMGGM3SJK7AZANCNFSM4JKA7GMQ

shahramn commented 3 years ago

The ecCodes library has to be built with thread safety enabled See https://confluence.ecmwf.int/display/UDOC/Is+ecCodes+thread-safe+-+ecCodes+FAQ

MatthewLennie commented 3 years ago

Thanks for the information. I am stumped then, can you think of another reason why I (and others) would see this behavior? m

On 08.10.2020 14:47, shahramn wrote:

The ecCodes library has to built with thread safety enabled See https://confluence.ecmwf.int/display/UDOC/Is+ecCodes+thread-safe+-+ecCodes+FAQ [1]

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [2], or unsubscribe [3].

Links:

[1] https://confluence.ecmwf.int/display/UDOC/Is+ecCodes+thread-safe+-+ecCodes+FAQ [2] https://github.com/ecmwf/cfgrib/issues/110#issuecomment-705544248 [3] https://github.com/notifications/unsubscribe-auth/ALQB7NJQTHBHRSJLT7ALJNDSJWYGVANCNFSM4JKA7GMQ

shahramn commented 3 years ago

Looks like the conda recipe does NOT enable the thread safety flags. I will look into this

MatthewLennie commented 3 years ago

Awesome. Thanks for looking into it. Not all heroes wear capes :)

On 08.10.2020 15:01, shahramn wrote:

Looks like the conda recipe does NOT enable the thread safety flags. I will look into this

-- You are receiving this because you commented. Reply to this email directly, view it on GitHub [1], or unsubscribe [2].

Links:

[1] https://github.com/ecmwf/cfgrib/issues/110#issuecomment-705551762 [2] https://github.com/notifications/unsubscribe-auth/ALQB7NON65KU5AOROSXQJJDSJWZ3NANCNFSM4JKA7GMQ

guidocioni commented 3 years ago

Sounds like you're on the right path. A few years ago, when cfgrib was still a baby :), I was getting an error while trying to read compressed grib files as the recipe for eccodes on conda was not including the compression library because of a license issue. So in the end the problem was on the eccodes side on conda.

shahramn commented 3 years ago

I have submitted a pull-request on conda... which has now been merged

shahramn commented 3 years ago

Dear Guido, Please try again and re-install ecCodes. Let me know if the issue is now fixed

guidocioni commented 3 years ago

Dear Guido, Please try again and re-install ecCodes. Let me know if the issue is now fixed

I've seen the update on github but cannot force an update of eccodes with the new recipe. Do I need to wait for a new version or is there a way to test this?

iainrussell commented 3 years ago

Can you try to update your conda eccodes to version "eccodes-2.18.0-hf05d9b7_0" ?

guidocioni commented 3 years ago

It still does not find it in my current channels:

(nwp-py3) g@c:~/$ conda install -c conda-forge eccodes=2.18.0=hf05d9b7_0
Collecting package metadata (current_repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): done
Solving environment: failed with initial frozen solve. Retrying with flexible solve.

PackagesNotFoundError: The following packages are not available from current channels:

  - eccodes==2.18.0=hf05d9b7_0

Current channels:

  - https://conda.anaconda.org/conda-forge/osx-64
  - https://conda.anaconda.org/conda-forge/noarch
  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch
iainrussell commented 3 years ago

I think you're right - it looks like it built but somehow is not available. We'll investigate

iainrussell commented 3 years ago

Hi @guidocioni , I'm not sure if you're on macos or Linux, but we've managed to update the conda version. Could you do the following: conda search eccodes -c conda-forge and if you see a 2.18.0 version with _1 at the end, install that version please. It takes the conda servers a little while to update their indexes, but it's appeared now at least on macos.

guidocioni commented 3 years ago

Hi @guidocioni , I'm not sure if you're on macos or Linux, but we've managed to update the conda version. Could you do the following: conda search eccodes -c conda-forge and if you see a 2.18.0 version with _1 at the end, install that version please. It takes the conda servers a little while to update their indexes, but it's appeared now at least on macos.

Yep it always takes a little bit of time..I will test it tomorrow and let you know. Anyway you can use one of the MWEs present in this thread with some downloaded data..I think you should be able to reproduce the error.

guidocioni commented 3 years ago

I can confirm this issue is resolved on eccodes 2.18.0-hc7b4307_1!

I just tried to read 6 files with parallel=False and parallel=True while taking care of removing the idx files every time and both methods worked. Before the update it used to fail with parallel=True as described in the posts before.

Thank you all for the input :)

@alexamici I think you can close this

MatthewLennie commented 3 years ago

I can also confirm. I just ran a test using delays = [] for file in files: delays.append(dask.delay(cfgrib.opendatasets(file), backend_kwargs={"indexpath":""}))

client.persist(delays)

It previous resulted in killed workers as described. Now the issue is resolved on eccodes 2.18.0-hc7b4307_1

Thanks for reacting to this so quickly. :)

meteoDaniel commented 1 year ago

Dear Friends, need to tell ya, that I never knew that my issue with the latest updates of cfgrib belongs to the parallel=True and unabled thread mode during installation. Would be great to see a website with some common pitfalls. Btw.: My system ran on an old version from 2021.

dhah229 commented 10 months ago

Is there a way to enable multi-threading without conda? I've installed cfgrib using

pip install ecmwflibs eccodes cfgrib

With versions: ecmwflibs==0.5.6 eccodes==1.6.1 cfgrib==0.9.10.4 on Python 3.8.16 using a docker image.