'OSError: [Errno 121]' while running ultranest

ikhebgeenaccount commented 1 year ago

UltraNest version: 3.6.1
Python version: 3.8.14
Operating System: Red Hat Enterprise Linux Server 7.9 (Maipo)

Description

I am fitting a model using ultranest, using a ReactiveNestedSampler.

What I Did

This is the code I run:

self.sampler = ultranest.ReactiveNestedSampler(self.varied_params, self.likelihood, self.prior,
                                                log_dir=f'ultranest_output/{run_name}/', ndraw_max=500)
self.results = self.sampler.run(max_ncalls=100000)

I have verified that the likelihood and prior give the right values. The max_ncalls is set because I run this for several different measurements, some of which I know will not converge (or at least not in any reasonable time).

ultranest does run for a while, showing the standard output (these are the last two loggings before the crash):

Mono-modal Volume: ~exp(-5.54)   Expected Volume: exp(-3.15) Quality: ok

   negative degeneracy between h2 and cdmol_SO: rho=-0.82
tkin     :  +6.0e+01|*********  ** ******** * * **** ******** ******** ******************* *****  ***   ***** ***** ****  ******** ************************ ********************** ** ********* * *** ** *********** *** ********* | +3.0e+02
cdmol_SO :      +9.0|                                                                                                             +13.0  ******************************************* ******* ************* ************************|    +16.0
h2       :      +2.0|       *  **  ***** ****** *** * ******** ****** ***** ******** **** ********** * ********************* ***** *************** ************* *** *************  **** *** ***** ***** ***  **** ** **** ** *****|     +8.0
cdmol_SiO:      +9.0|*** ********   *************** ************** **** *** *******  **   ************ *********************** ****** ********** ***** ********* ******  ** **** ************* ********************************** *|    +16.0

Z=-9.1(0.71%) | Like=-3.96..-0.00 [-10.9493..-3.4133] | it/evals=1349/4632 eff=31.8762% N=400  

Mono-modal Volume: ~exp(-6.35) * Expected Volume: exp(-3.37) Quality: ok

   negative degeneracy between h2 and cdmol_SO: rho=-0.80
tkin     :  +6.0e+01|********** *********** * * **** ******** **************** *********** *****   *  * *** ************ ********* ************************ *** ******************  * *  ****** ***** ** ******** ** *** ********* | +3.0e+02
cdmol_SO :      +9.0|                                                                                                              +13.0  ************************************************** ************* ********************  **|    +16.0
h2       :      +2.0|       *  *   ****************** ******** ** *** ********* **** **** *** * ****** ***** ************  * ***** * *************************** ******* ********** ************** ***** ***  **** ** **** ** *****|     +8.0
cdmol_SiO:      +9.0|*** ***** ** * ** ************ ************** ******** * *********   ************* ******************** * ****** ********** ***** ********* ******* ** ******** **** **** ********************** *************|    +16.0

But at some point I get the following error:

Traceback (most recent call last):0 [-14.8026..-4.4128] | it/evals=1172/3078 eff=43.7640% N=400 
  File "data_analysis.py", line 395, in <module>
    find_conditions(detected_lines_df, create_rot_diagram=create_rot_diagram, run_radex=run_mcmc)
  File "data_analysis.py", line 336, in find_conditions
    param_names, median, errup, errlo, tau_dict = param_sampler.find_parameters_line_intensities(run_name=component_name)
  File "/project/analysis/spectral_radex.py", line 224, in find_parameters_line_intensities
    self.results = self.sampler.run(max_ncalls=100000)
  File "/project/venv/lib64/python3.8/site-packages/ultranest/integrator.py", line 2362, in run
    for result in self.run_iter(
  File "/project/venv/lib64/python3.8/site-packages/ultranest/integrator.py", line 2633, in run_iter
    u, p, L = self._create_point(Lmin=Lmin, ndraw=ndraw, active_u=active_u, active_values=active_values)
  File "/project/venv/lib64/python3.8/site-packages/ultranest/integrator.py", line 1911, in _create_point
    self.pointstore.add(
  File "/project/venv/lib64/python3.8/site-packages/ultranest/store.py", line 214, in add
    self.fileobj['points'][self.nrows,:] = row
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "/project/venv/lib64/python3.8/site-packages/h5py/_hl/dataset.py", line 999, in __setitem__
    self.id.write(mspace, fspace, val, mtype, dxpl=self._dxpl)
  File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper
  File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper
  File "h5py/h5d.pyx", line 283, in h5py.h5d.DatasetID.write
  File "h5py/_proxy.pyx", line 114, in h5py._proxy.dset_rw
OSError: [Errno 121] Can't write data (file read failed: time = Thu Jul 13 16:28:52 2023
, filename = 'ultranest_output/K_-33.5/run3/results/points.hdf5', file descriptor = 11, errno = 121, error message = 'Remote I/O error', buf = 0x46edd18, total read size = 6144, bytes this sub-read = 6144, bytes actually read = 18446744073709551615, offset = 0)
Error in sys.excepthook:

Original exception was:
Segmentation fault (core dumped)

The time it takes to crash varies between 1 and 4 minutes.

I have updated h5py to 3.9.0 (its latest version), but the problem persisted. I run the Python script on a virtual desktop server within a virtual environment. Anything that might help me fix this is much appreciated, thanks!

JohannesBuchner commented 1 year ago

what ultranest does to store to hdf5 is essentially equivalent to something like:

import h5py
import numpy as np
import time

ncols = 10

filepath = 'ultranest_output/K_-33.5/run3/results/points.hdf5'
fileobj = h5py.File(filepath, mode='w')

fileobj.create_dataset(
    'points', dtype=float,
    shape=(0, ncols), maxshape=(None, self.ncols))

nrows = 1
while True:
    fileobj['points'].resize(nrows + 1, axis=0)
    fileobj['points'][self.nrows,:] = np.random.uniform(ncols)
    fileobj.attrs['ncalls'] = nrows
    nrows += 1
    time.sleep(1)

It looks like the (virtual) file system is unstable, or there is some issue with hdf5 reads. Maybe you can play a bit with the script above, vary the intensity of writes and the row sizes (ncol), and see if you can reproduce the bug outside ultranest?

You get a segfault in addition to the OSError, not sure which is first?

When you run ultranest, you just run one process? I just want to make sure they do not work on the same file. If you use MPI, you need to install mpi4py.

ikhebgeenaccount commented 1 year ago

Thanks for your quick response.

I played around with the script you provided and went up to ncols=100000 and time.sleep(.0001) and I was unable to reproduce the bug. My initial guess was the same as yours: that it is probably an issue with the server. I went to the IT staff, who could not find an issue with the server (unfortunately for me).

When I run ultranest, I do run just one process. No other processes are using the points.hdf5 file.

Regarding the segfault, it seems like the segfault happens, causing the OSError to be thrown. So perhaps it's an h5py issue rather than an ultranest one.

ikhebgeenaccount commented 1 year ago

I have tried running it on a different machine - and it's working now. So I suspect that it is an issue with the server.

JohannesBuchner / UltraNest

'OSError: [Errno 121]' while running ultranest #98

Description

What I Did