RuntimeError: NetCDF: Not a valid ID

ArcticSnow / TopoPyScale

TopoPyScale: a Python library to perform simplistic climate downscaling at the hillslope scale

https://topopyscale.readthedocs.io

MIT License

39 stars 9 forks source link

RuntimeError: NetCDF: Not a valid ID #80

Open joelfiddes opened 1 year ago

joelfiddes commented 1 year ago

This is a strange and somewhat random error - not always reproducible. From reading and as often case with strange random errors it may be related to multipe threads accessing same file at same time, here is a discussion.

https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389

Nice find! To summarise in this thread, it looks like a work-around in netcdf4-python to deal with netcdf-c not being thread safe was removed in 1.6.1. The solution (for now) is to [make sure your cluster only uses 1 thread per worker](https://forum.access-hive.org.au/t/netcdf-not-a-valid-id-errors/389/14).

joelfiddes commented 1 year ago

I think we only have 1 thread per worker anyway with this?

https://github.com/ArcticSnow/TopoPyScale/blob/e0a79e882904d90596cd278ba9f273a8a11dbb3b/TopoPyScale/topo_scale.py#L210

joelfiddes commented 1 year ago

i understand 1 worker = 1 core?

joelfiddes commented 1 year ago

changed

ds_ = xr.open_mfdataset(flist, parallel=True)

ds_ = xr.open_mfdataset(flist, parallel=False)

and ran fine with no errors. I dont fully understand it so cant be confidently claimed to be a fix. WIll need to run a bunch more times to see if it really is a fix.

joelfiddes commented 1 year ago

This is on a branch "slurm" where am developing an embarrasingly paralilsable way of dealing with time dimension as current method only works if the script is run on a multicore machine NOT using a SLURM scheduler as on many HPC machines. This problem may be unique to that usecase (many workers accessing climate data netcdfs simultaneously. But I think @ArcticSnow mentioned seeing this issue and as discussion above shows - seems to happen with multi thread access to nc files.

ArcticSnow commented 1 year ago

The multiprocessing library has both multithread and multicore. one core can handle multithreads. It is very convenient for instance to send and handle the download request (requiring little computation). Maybe in the config file we should separate and have one n_cores and n_threads to clarify a bit.

Also, notice that v0.2.2 does not parallelise in the time dimension. Parallelisation is only happening in space. Each time split are run sequentialy, when the previous one is done.

joelfiddes commented 1 year ago

of course - so actually this is a more general contribution - will write up the approach in discussions and link back here

joelfiddes commented 1 year ago

https://github.com/ArcticSnow/TopoPyScale/discussions/83#discussion-5231360

joelfiddes commented 1 year ago

some more I think related info on this issue

https://github.com/ecmwf/cfgrib/issues/110

Basically seems safer to use parallel =False with mf_opendataset otherwise there is a chance of conflict between threads doing "stuff" on the nc file at the same time. There used to be a "lock" and "autoclose" args to the function but no longer. Maybe these are somehow implicitly in Parallel =False (this is also the default setting.