Is this a bug, or user error: NotImplementedError: Dataset is not picklable

NCAS-CMS / cf-python

A CF-compliant Earth Science data analysis library

http://ncas-cms.github.io/cf-python

MIT License

120 stars 19 forks source link

Is this a bug, or user error: NotImplementedError: Dataset is not picklable #735

Open bnlawrence opened 6 months ago

bnlawrence commented 6 months ago

python
Python 3.11.8 | packaged by conda-forge | (main, Feb 16 2024, 20:53:32) [GCC 12.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cf
>>> cf.__version__
'3.16.1'

Attempt to use cf-python to read pp and write some netcdf. Code is:

import cf
import dask

dask.config.set(scheduler='processes',num_workers=12)

def convert(glob):
    ff = cf.read(glob)
    cf.write(ff,'all_year.nc',mode='w')

if __name__=="__main__":
   convert('*.pp')

Platform is jasmin sci6, data is N1280 pp output.

Error log here

davidhassell commented 6 months ago

Hi Bryan,

A bit of digging suggests that this is a bug (https://github.com/pydata/xarray/issues/1464 has the details). However, the writing is locked anyway (a netCDF4-python restriction), so there shouldn't be any benefit in this case from running on 12 workers.

If you remove the dask.config.set(...) line, I suspect that it will work.

I shall make the fix, though, so that your original code works doesn't fail.

davidhassell commented 6 months ago

I shall make the fix, though, so that your original code works doesn't fail.

Looking into how xarray deals with this (which I haven't wholly understood, yet), it's probably not the 5 minute fix I dreamt of, but I'll keep at it ...

bnlawrence commented 6 months ago

(Sorry, I was hoping that I would get benefit from the workers on the read, since the pp bit is slow)

davidhassell commented 6 months ago

OK - we can read PP/FF files in parallel, so if you did (ff[0] + 2).array the reads would be parallised over Dask chunks, but writing is limited to one Dask chunk at a time, and a Dask chunk equates to one 2-d UM field, and so no benefit from parallelism in the writing case :(