HDF5 error while trying to read trajectories with 0.23.0 and 0.23.1

slochower commented 1 month ago

Hi, we're seeing HDF errors when trying to load Perses simulations with openmmtools 0.23.0 (from conda) or 0.23.1 (from pip install this repo).

I think it is very similar to https://github.com/choderalab/openmmtools/issues/666 except we have tried the suggestions and they are not working for us. Specifically, we're installing a Perses environment from this file (with mamba) followed by a local pip install of Perses itself. Even if we force libnetcdf==4.8.1, netcdf4==1.5.8, and hdf5==1.12.1 (in the environment YAML or after activating the environment), this is what we see when trying to read a trajectory:

(perses) root@c7aa70c51957:/vx-dcs-perses# cat test.py
from perses.analysis.load_simulations import Simulation

_sim = Simulation("lig0to8/")
print(_sim)

(perses) root@c7aa70c51957:/vx-dcs-perses# python test.py
Warning on use of the timeseries module: If the inherent timescales of the system are long compared to those being analyzed, this statistical inefficiency may be an underestimate.  The estimate presumes the use of many statistically independent samples.  Tests should be performed to assess whether this condition is satisfied.   Be cautious in the interpretation of the data.

****** PyMBAR will use 64-bit JAX! *******
* JAX is currently set to 32-bit bitsize *
* which is its default.                  *
*                                        *
* PyMBAR requires 64-bit mode and WILL   *
* enable JAX's 64-bit mode when called.  *
*                                        *
* This MAY cause problems with other     *
* Uses of JAX in the same code.          *
******************************************

Warning: The openmmtools.multistate API is experimental and may change in future releases
Warning: The openmmtools.multistate API is experimental and may change in future releases
Traceback (most recent call last):
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 411, in __get__
    value = instance._cache[self.name]
KeyError: 'mbar'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 411, in __get__
    value = instance._cache[self.name]
KeyError: 'unbiased_decorrelated_u_ln'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 411, in __get__
    value = instance._cache[self.name]
KeyError: 'decorrelated_u_ln'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/vx-dcs-perses/test.py", line 3, in <module>
    _sim = Simulation("lig0to8/")
  File "/root/conda/envs/perses/lib/python3.10/site-packages/perses/analysis/load_simulations.py", line 97, in __init__
    self._load_data()
  File "/root/conda/envs/perses/lib/python3.10/site-packages/perses/analysis/load_simulations.py", line 157, in _load_data
    f_ij, df_ij = solvent_analyzer.get_free_energy()
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1965, in get_free_energy
    self._compute_free_energy()
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1916, in _compute_free_energy
    nstates = self.mbar.N_k.size
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 413, in __get__
    value = self._get_default(instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 434, in _get_default
    value = self._default(self, instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 2178, in mbar
    return instance._create_mbar(instance._unbiased_decorrelated_u_ln,
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 413, in __get__
    value = self._get_default(instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 434, in _get_default
    value = self._default(self, instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 2154, in _unbiased_decorrelated_u_ln
    return instance._compute_mbar_unbiased_energies()[0]
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1579, in _compute_mbar_unbiased_energies
    self._unbiased_decorrelated_u_ln = self._decorrelated_u_ln
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 413, in __get__
    value = self._get_default(instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 434, in _get_default
    value = self._default(self, instance)
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 2135, in _decorrelated_u_ln
    return instance._compute_mbar_decorrelated_energies()[0]
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 1489, in _compute_mbar_decorrelated_energies
    energy_data = list(self._read_energies(truncate_max_n_iterations=True))
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistateanalyzer.py", line 867, in _read_energies
    energy_data = list(self._reporter.read_energies())
  File "/root/conda/envs/perses/lib/python3.10/site-packages/openmmtools/multistate/multistatereporter.py", line 816, in read_energies
    energy_thermodynamic_states = np.array(self._storage_analysis.variables['energies'][iteration, :, :], np.float64)
  File "src/netCDF4/_netCDF4.pyx", line 4406, in netCDF4._netCDF4.Variable.__getitem__
  File "src/netCDF4/_netCDF4.pyx", line 5350, in netCDF4._netCDF4.Variable._get
  File "src/netCDF4/_netCDF4.pyx", line 1927, in netCDF4._netCDF4._ensure_nc_success
RuntimeError: NetCDF: HDF error

There are a few observations I find confusing:

This environment is running in Docker in AWS. I have a local environment with mostly matching packages that can read and analyze the same trajectories. I say "mostly matching" because I'm not sure if the Python NetCDF bindings are actually linked to different C libraries for different architectures (Mac vs. Linux).
Running the simulations and analyzing them is happening in the same conda environment -- and therefore the same NetCDF libraries. So somehow netcdf/hdf5 are writing trajectories that they later can't read in.
I can't predict which edges will have an HDF5 error, but it is reproducible.
In one of my earlier attempts while messing with the environment, the NetCDF C library was crashing and leading to a segmentation fault, so I'm pretty sure the issue is there. I could keep trying different versions of the Python libraries, but it seems like the version that writes the files should be able to read them...

ijpulidos commented 1 month ago

@slochower Thanks for the feedback!

If I'm understanding this correctly, this is only happening in the AWS instance, but locally things work fine, correct?

slochower commented 1 month ago

Ah, yes, I should have been more explicit. This is happening with a Docker image on AWS or running the x86-64 Docker image locally. I've tried FROM --platform=linux/amd64 nvidia/cuda:12.5.1-base-ubuntu24.04 and FROM --platform=linux/amd64 nvidia/cuda:12.2.0-base-ubuntu20.04 followed by installation of either Miniconda3-latest-Linux-x86_64.sh or Miniforge3-Linux-x86_64.sh and then creating the environment with either conda or mamba. We are really stuck figuring out the right dependency stack. This environment is used for running & analyzing the edges.

Locally I have an Arm environment on Mac, built from the same environment.yaml file that doesn't have issues running the analysis.

slochower commented 1 month ago

Here is some more information. I can now reproduce the issue (in more severe form) with this more minimal Dockerfile:

FROM --platform=linux/amd64 nvidia/cuda:12.5.1-base-ubuntu24.04

RUN apt-get update && \
    DEBIAN_FRONTEND=noninteractive TZ="America/New_York" \
    apt-get install -y  \
    build-essential \
    graphviz \
    graphviz-dev \
    groff \
    libxrender1  \
    wget  \
    && rm -rf /var/lib/apt/lists/*

RUN wget \
    "https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh"  \
    && bash Miniforge3-Linux-x86_64.sh -b -p "/root/conda" \
    && rm -f Miniforge3-Linux-x86_64.sh

ENV PATH /root/conda/bin:$PATH
# The key change here is to get the environment not from the main (public)
# GitHub repository, but rather, from the conda-forge build recipe.
# This is similar but a little more restrictive:
# https://github.com/conda-forge/perses-feedstock/blob/fee858d534528566a60f8e8a9273870cae6f93a7/recipe/meta.yaml
RUN mamba create -n perses -c conda-forge -c openeye perses==0.10.3 openeye-toolkits --yes

If I try to read a trajectory with DEBUG logging, I see:

[...]
Both vacuum and solvent legs need to be run for hydration free energies
WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as lig0to2/nf-tower-solvent_checkpoint.nc
WARNING:openmmtools.multistate.multistateanalyzer:Warning: The openmmtools.multistate API is experimental and may change in future releases
Segmentation fault (core dumped)

Debug symbols are missing from the core file, but I think it's coming from the underlying C library. I'm going to keep troubleshooting the environment, but I'm really really curious why this is happening in a somewhat controlled environment. Would it be possible for you to send a conda env export from a linux x86-64 host that you know is working?

slochower commented 1 month ago

What is bizarre is that while I've been making small tweaks to the Dockerfile, I've been using the test.py file (above) and in the current container version, that works file. I can read that trajectory (lig0to8) but if I try another one (lig12to15) -- again, all written from the same environment in the same container -- now I get an HDF5 error for this trajectory (not a seg fault, though!). It does not seem deterministic which container will be unable to read a given trajectory, but whenever I build a container with different versions of netcdf-related libraries, if that container can't load a given trajectory, then it is reproducible.

Here is the conda list for the environment that's not working.

Conda List

```bash (perses) root@2360c5b7cbf4:/vx-dcs-perses# conda list # packages in environment at /root/conda/envs/perses: # # Name Version Build Channel _libgcc_mutex 0.1 conda_forge conda-forge _openmp_mutex 4.5 2_gnu conda-forge aiohttp 3.9.5 py312h98912ed_0 conda-forge aiosignal 1.3.1 pyhd8ed1ab_0 conda-forge alabaster 1.0.0 pyhd8ed1ab_0 conda-forge ambertools 23.6 cuda_None_nompi_py312h41aabf6_105 conda-forge amberutils 21.0 pypi_0 pypi annotated-types 0.7.0 pyhd8ed1ab_0 conda-forge anyio 4.4.0 pyhd8ed1ab_0 conda-forge argon2-cffi 23.1.0 pyhd8ed1ab_0 conda-forge argon2-cffi-bindings 21.2.0 py312h98912ed_4 conda-forge arpack 3.9.1 nompi_h77f6705_101 conda-forge arrow 1.3.0 pyhd8ed1ab_0 conda-forge asttokens 2.4.1 pyhd8ed1ab_0 conda-forge astunparse 1.6.3 pyhd8ed1ab_0 conda-forge async-lru 2.0.4 pyhd8ed1ab_0 conda-forge attrs 23.2.0 pyh71513ae_0 conda-forge aws-c-auth 0.7.22 hbd3ac97_10 conda-forge aws-c-cal 0.7.1 h87b94db_1 conda-forge aws-c-common 0.9.23 h4ab18f5_0 conda-forge aws-c-compression 0.2.18 he027950_7 conda-forge aws-c-event-stream 0.4.2 h7671281_15 conda-forge aws-c-http 0.8.2 he17ee6b_6 conda-forge aws-c-io 0.14.10 h826b7d6_1 conda-forge aws-c-mqtt 0.10.4 hcd6a914_8 conda-forge aws-c-s3 0.6.0 h365ddd8_2 conda-forge aws-c-sdkutils 0.1.16 he027950_3 conda-forge aws-checksums 0.1.18 he027950_7 conda-forge aws-crt-cpp 0.27.3 hda66527_2 conda-forge aws-sdk-cpp 1.11.329 h46c3b66_9 conda-forge azure-core 1.30.2 pyhd8ed1ab_0 conda-forge azure-core-cpp 1.13.0 h935415a_0 conda-forge azure-identity-cpp 1.8.0 hd126650_2 conda-forge azure-storage-blob 12.21.0 pyhd8ed1ab_0 conda-forge azure-storage-blobs-cpp 12.12.0 hd2e3451_0 conda-forge azure-storage-common-cpp 12.7.0 h10ac4d7_1 conda-forge azure-storage-files-datalake-cpp 12.11.0 h325d260_1 conda-forge babel 2.14.0 pyhd8ed1ab_0 conda-forge beautifulsoup4 4.12.3 pyha770c72_0 conda-forge bleach 6.1.0 pyhd8ed1ab_0 conda-forge blosc 1.21.6 hef167b5_0 conda-forge bokeh 3.5.1 pyhd8ed1ab_0 conda-forge boto3 1.34.152 pyhd8ed1ab_0 conda-forge botocore 1.34.152 pyge310_1234567_0 conda-forge brotli 1.1.0 hd590300_1 conda-forge brotli-bin 1.1.0 hd590300_1 conda-forge brotli-python 1.1.0 py312h30efb56_1 conda-forge bson 0.5.9 py_0 conda-forge bzip2 1.0.8 h4bc722e_7 conda-forge c-ares 1.32.3 h4bc722e_0 conda-forge c-blosc2 2.15.1 hc57e6cf_0 conda-forge ca-certificates 2024.7.4 hbcca054_0 conda-forge cached-property 1.5.2 hd8ed1ab_1 conda-forge cached_property 1.5.2 pyha770c72_1 conda-forge cachetools 5.4.0 pyhd8ed1ab_0 conda-forge cairo 1.18.0 hebfffa5_3 conda-forge certifi 2024.7.4 pyhd8ed1ab_0 conda-forge cffi 1.16.0 py312hf06ca03_0 conda-forge cftime 1.6.4 py312h085067d_0 conda-forge chardet 5.2.0 py312h7900ff3_1 conda-forge charset-normalizer 3.3.2 pyhd8ed1ab_0 conda-forge click 8.1.7 unix_pyh707e725_0 conda-forge cloudpathlib 0.18.1 pyhd8ed1ab_0 conda-forge cloudpathlib-all 0.18.1 pyhd8ed1ab_0 conda-forge cloudpathlib-azure 0.18.1 pyhd8ed1ab_0 conda-forge cloudpathlib-gs 0.18.1 pyhd8ed1ab_0 conda-forge cloudpathlib-s3 0.18.1 pyhd8ed1ab_0 conda-forge cloudpickle 3.0.0 pyhd8ed1ab_0 conda-forge colorama 0.4.6 pyhd8ed1ab_0 conda-forge comm 0.2.2 pyhd8ed1ab_0 conda-forge contourpy 1.2.1 py312h8572e83_0 conda-forge cryptography 43.0.0 py312h8aaac84_0 conda-forge cudatoolkit 11.8.0 h4ba93d1_13 conda-forge cycler 0.12.1 pyhd8ed1ab_0 conda-forge cytoolz 0.12.3 py312h98912ed_0 conda-forge dask 2024.7.1 pyhd8ed1ab_0 conda-forge dask-core 2024.7.1 pyhd8ed1ab_0 conda-forge dask-expr 1.1.9 pyhd8ed1ab_0 conda-forge dask-jobqueue 0.8.5 pyhd8ed1ab_0 conda-forge debugpy 1.8.2 py312h7070661_0 conda-forge decorator 5.1.1 pyhd8ed1ab_0 conda-forge defusedxml 0.7.1 pyhd8ed1ab_0 conda-forge dicttoxml 1.7.16 pyhd8ed1ab_0 conda-forge distributed 2024.7.1 pyhd8ed1ab_0 conda-forge docutils 0.21.2 pyhd8ed1ab_0 conda-forge edgembar 0.2 pypi_0 pypi entrypoints 0.4 pyhd8ed1ab_0 conda-forge exceptiongroup 1.2.2 pyhd8ed1ab_0 conda-forge executing 2.0.1 pyhd8ed1ab_0 conda-forge expat 2.6.2 h59595ed_0 conda-forge fftw 3.3.10 nompi_hf1063bd_110 conda-forge fire 0.6.0 pyhd8ed1ab_0 conda-forge font-ttf-dejavu-sans-mono 2.37 hab24e00_0 conda-forge font-ttf-inconsolata 3.000 h77eed37_0 conda-forge font-ttf-source-code-pro 2.038 h77eed37_0 conda-forge font-ttf-ubuntu 0.83 h77eed37_2 conda-forge fontconfig 2.14.2 h14ed4e7_0 conda-forge fonts-conda-ecosystem 1 0 conda-forge fonts-conda-forge 1 0 conda-forge fonttools 4.53.1 py312h41a817b_0 conda-forge fqdn 1.5.1 pyhd8ed1ab_0 conda-forge freetype 2.12.1 h267a509_2 conda-forge freetype-py 2.3.0 pyhd8ed1ab_0 conda-forge frozenlist 1.4.1 py312h98912ed_0 conda-forge fsspec 2024.6.1 pyhff2d567_0 conda-forge gflags 2.2.2 he1b5a44_1004 conda-forge glog 0.7.1 hbabe93e_0 conda-forge google-api-core 2.19.1 pyhd8ed1ab_0 conda-forge google-auth 2.32.0 pyhff2d567_0 conda-forge google-cloud-core 2.4.1 pyhd8ed1ab_0 conda-forge google-cloud-storage 2.18.0 pyhff2d567_0 conda-forge google-crc32c 1.1.2 py312h775cd15_5 conda-forge google-resumable-media 2.7.0 pyhd8ed1ab_0 conda-forge googleapis-common-protos 1.63.2 pyhd8ed1ab_0 conda-forge greenlet 3.0.3 py312h30efb56_0 conda-forge grpcio 1.62.2 py312hb06c811_0 conda-forge h11 0.14.0 pyhd8ed1ab_0 conda-forge h2 4.1.0 pyhd8ed1ab_0 conda-forge hdf4 4.2.15 h2a13503_7 conda-forge hdf5 1.14.3 nompi_hdf9ad27_105 conda-forge hpack 4.0.0 pyh9f0ad1d_0 conda-forge httpcore 1.0.5 pyhd8ed1ab_0 conda-forge httpx 0.27.0 pyhd8ed1ab_0 conda-forge hyperframe 6.0.1 pyhd8ed1ab_0 conda-forge icu 75.1 he02047a_0 conda-forge idna 3.7 pyhd8ed1ab_0 conda-forge imagesize 1.4.1 pyhd8ed1ab_0 conda-forge importlib-metadata 8.2.0 pyha770c72_0 conda-forge importlib_metadata 8.2.0 hd8ed1ab_0 conda-forge importlib_resources 6.4.0 pyhd8ed1ab_0 conda-forge ipykernel 6.29.5 pyh3099207_0 conda-forge ipython 8.26.0 pyh707e725_0 conda-forge ipywidgets 8.1.3 pyhd8ed1ab_0 conda-forge isodate 0.6.1 pyhd8ed1ab_0 conda-forge isoduration 20.11.0 pyhd8ed1ab_0 conda-forge jedi 0.19.1 pyhd8ed1ab_0 conda-forge jinja2 3.1.4 pyhd8ed1ab_0 conda-forge jmespath 1.0.1 pyhd8ed1ab_0 conda-forge joblib 1.4.2 pyhd8ed1ab_0 conda-forge json5 0.9.25 pyhd8ed1ab_0 conda-forge jsonpointer 3.0.0 py312h7900ff3_0 conda-forge jsonschema 4.23.0 pyhd8ed1ab_0 conda-forge jsonschema-specifications 2023.12.1 pyhd8ed1ab_0 conda-forge jsonschema-with-format-nongpl 4.23.0 hd8ed1ab_0 conda-forge jupyter-lsp 2.2.5 pyhd8ed1ab_0 conda-forge jupyter_client 8.6.2 pyhd8ed1ab_0 conda-forge jupyter_core 5.7.2 py312h7900ff3_0 conda-forge jupyter_events 0.10.0 pyhd8ed1ab_0 conda-forge jupyter_server 2.14.2 pyhd8ed1ab_0 conda-forge jupyter_server_terminals 0.5.3 pyhd8ed1ab_0 conda-forge jupyterlab 4.2.4 pyhd8ed1ab_0 conda-forge jupyterlab_pygments 0.3.0 pyhd8ed1ab_1 conda-forge jupyterlab_server 2.27.3 pyhd8ed1ab_0 conda-forge jupyterlab_widgets 3.0.11 pyhd8ed1ab_0 conda-forge keyutils 1.6.1 h166bdaf_0 conda-forge kiwisolver 1.4.5 py312h8572e83_1 conda-forge krb5 1.21.3 h659f571_0 conda-forge lcms2 2.16 hb7c19ff_0 conda-forge ld_impl_linux-64 2.40 hf3520f5_7 conda-forge lerc 4.0.0 h27087fc_0 conda-forge libabseil 20240116.2 cxx17_he02047a_1 conda-forge libaec 1.1.3 h59595ed_0 conda-forge libarrow 17.0.0 h4b47046_3_cpu conda-forge libarrow-acero 17.0.0 he02047a_3_cpu conda-forge libarrow-dataset 17.0.0 he02047a_3_cpu conda-forge libarrow-substrait 17.0.0 hc9a23c6_3_cpu conda-forge libblas 3.9.0 23_linux64_openblas conda-forge libboost 1.84.0 h0ccab89_5 conda-forge libboost-python 1.84.0 py312hf74af30_5 conda-forge libbrotlicommon 1.1.0 hd590300_1 conda-forge libbrotlidec 1.1.0 hd590300_1 conda-forge libbrotlienc 1.1.0 hd590300_1 conda-forge libcblas 3.9.0 23_linux64_openblas conda-forge libcrc32c 1.1.2 h9c3ff4c_0 conda-forge libcurl 8.9.1 hdb1bdb2_0 conda-forge libdeflate 1.20 hd590300_0 conda-forge libedit 3.1.20191231 he28a2e2_2 conda-forge libev 4.33 hd590300_2 conda-forge libevent 2.1.12 hf998b51_1 conda-forge libexpat 2.6.2 h59595ed_0 conda-forge libffi 3.4.2 h7f98852_5 conda-forge libgcc-ng 14.1.0 h77fa898_0 conda-forge libgfortran-ng 14.1.0 h69a702a_0 conda-forge libgfortran5 14.1.0 hc5f4f2c_0 conda-forge libglib 2.80.3 h8a4344b_1 conda-forge libgomp 14.1.0 h77fa898_0 conda-forge libgoogle-cloud 2.26.0 h26d7fe4_0 conda-forge libgoogle-cloud-storage 2.26.0 ha262f82_0 conda-forge libgrpc 1.62.2 h15f2491_0 conda-forge libiconv 1.17 hd590300_2 conda-forge libjpeg-turbo 3.0.0 hd590300_1 conda-forge liblapack 3.9.0 23_linux64_openblas conda-forge libllvm14 14.0.6 hcd5def8_4 conda-forge libnetcdf 4.9.2 nompi_h135f659_114 conda-forge libnghttp2 1.58.0 h47da74e_1 conda-forge libnsl 2.0.1 hd590300_0 conda-forge libopenblas 0.3.27 pthreads_hac2b453_1 conda-forge libparquet 17.0.0 h9e5060d_3_cpu conda-forge libpng 1.6.43 h2797004_0 conda-forge libpq 16.3 ha72fbe1_0 conda-forge libprotobuf 4.25.3 h08a7969_0 conda-forge librdkit 2024.03.5 h79cfef2_2 conda-forge libre2-11 2023.09.01 h5a48ba9_2 conda-forge libsodium 1.0.18 h36c2ea0_1 conda-forge libsqlite 3.46.0 hde9e2c9_0 conda-forge libssh2 1.11.0 h0841786_0 conda-forge libstdcxx-ng 14.1.0 hc0a3c3a_0 conda-forge libthrift 0.19.0 hb90f79a_1 conda-forge libtiff 4.6.0 h1dd3fc0_3 conda-forge libutf8proc 2.8.0 h166bdaf_0 conda-forge libuuid 2.38.1 h0b41bf4_0 conda-forge libwebp-base 1.4.0 hd590300_0 conda-forge libxcb 1.16 hd590300_0 conda-forge libxcrypt 4.4.36 hd590300_1 conda-forge libxml2 2.12.7 he7c6b58_4 conda-forge libxslt 1.1.39 h76b75d6_0 conda-forge libzip 1.10.1 h2629f0a_3 conda-forge libzlib 1.3.1 h4ab18f5_1 conda-forge llvmlite 0.43.0 py312h9c5d478_0 conda-forge locket 1.0.0 pyhd8ed1ab_0 conda-forge lxml 5.2.2 py312hb90d8a5_0 conda-forge lz4 4.3.3 py312h03f37cb_0 conda-forge lz4-c 1.9.4 hcb278e6_0 conda-forge lzo 2.10 hd590300_1001 conda-forge markdown-it-py 3.0.0 pyhd8ed1ab_0 conda-forge markupsafe 2.1.5 py312h98912ed_0 conda-forge matplotlib-base 3.9.1 py312h854627b_1 conda-forge matplotlib-inline 0.1.7 pyhd8ed1ab_0 conda-forge mda-xdrlib 0.2.0 pyhd8ed1ab_0 conda-forge mdtraj 1.10.0 py312h0b8b674_0 conda-forge mdurl 0.1.2 pyhd8ed1ab_0 conda-forge mistune 3.0.2 pyhd8ed1ab_0 conda-forge mmpbsa-py 16.0 pypi_0 pypi mpiplus v0.0.2 pyhd8ed1ab_0 conda-forge msgpack-python 1.0.8 py312h2492b07_0 conda-forge multidict 6.0.5 py312h98912ed_0 conda-forge munkres 1.1.4 pyh9f0ad1d_0 conda-forge nbclient 0.10.0 pyhd8ed1ab_0 conda-forge nbconvert-core 7.16.4 pyhd8ed1ab_1 conda-forge nbformat 5.10.4 pyhd8ed1ab_0 conda-forge ncurses 6.5 h59595ed_0 conda-forge nest-asyncio 1.6.0 pyhd8ed1ab_0 conda-forge netcdf-fortran 4.6.1 nompi_h228c76a_104 conda-forge netcdf4 1.7.1 nompi_py312h1ef7fb6_101 conda-forge networkx 3.3 pyhd8ed1ab_1 conda-forge nomkl 1.0 h5ca1d4c_0 conda-forge nose 1.3.7 py_1006 conda-forge notebook 7.2.1 pyhd8ed1ab_0 conda-forge notebook-shim 0.2.4 pyhd8ed1ab_0 conda-forge numba 0.60.0 py312h83e6fd3_0 conda-forge numexpr 2.10.0 py312hf412c99_100 conda-forge numpy 1.26.4 py312heda63a1_0 conda-forge numpydoc 1.7.0 pyhd8ed1ab_3 conda-forge ocl-icd 2.3.2 hd590300_1 conda-forge ocl-icd-system 1.0.0 1 conda-forge openff-amber-ff-ports 0.0.4 pyhca7485f_0 conda-forge openff-forcefields 2024.07.0 pyhff2d567_0 conda-forge openff-interchange 0.3.29 pyhd8ed1ab_0 conda-forge openff-interchange-base 0.3.29 pyhd8ed1ab_0 conda-forge openff-models 0.1.2 pyhca7485f_0 conda-forge openff-toolkit 0.16.2 pyhd8ed1ab_0 conda-forge openff-toolkit-base 0.16.2 pyhd8ed1ab_0 conda-forge openff-units 0.2.2 pyhca7485f_0 conda-forge openff-utilities 0.1.12 pyhd8ed1ab_0 conda-forge openjpeg 2.5.2 h488ebb8_0 conda-forge openmm 8.1.2 py312h3328022_2 conda-forge openmmforcefields 0.14.1 pyhd8ed1ab_0 conda-forge openmmtools 0.23.1 pyhd8ed1ab_0 conda-forge openmoltools 0.8.8 pyhd8ed1ab_1 conda-forge openssl 3.3.1 h4bc722e_2 conda-forge orc 2.0.1 h17fec99_1 conda-forge overrides 7.7.0 pyhd8ed1ab_0 conda-forge packaging 24.1 pyhd8ed1ab_0 conda-forge packmol-memgen 2024.2.9 pypi_0 pypi pandas 2.2.2 py312h1d6d2e6_1 conda-forge pandocfilters 1.5.0 pyhd8ed1ab_0 conda-forge panedr 0.8.0 pyhd8ed1ab_0 conda-forge parmed 4.2.2 py312h30efb56_1 conda-forge parso 0.8.4 pyhd8ed1ab_0 conda-forge partd 1.4.2 pyhd8ed1ab_0 conda-forge patsy 0.5.6 pyhd8ed1ab_0 conda-forge pcre2 10.44 h0f59acf_0 conda-forge pdb4amber 22.0 pypi_0 pypi pdbfixer 1.9 pyh1a96a4e_0 conda-forge perl 5.32.1 7_hd590300_perl5 conda-forge perses 0.10.3 pyh7448d05_0 conda-forge pexpect 4.9.0 pyhd8ed1ab_0 conda-forge pickleshare 0.7.5 py_1003 conda-forge pillow 10.4.0 py312h287a98d_0 conda-forge pint 0.23 pyhd8ed1ab_1 conda-forge pip 24.2 pyhd8ed1ab_0 conda-forge pixman 0.43.2 h59595ed_0 conda-forge pkgutil-resolve-name 1.3.10 pyhd8ed1ab_1 conda-forge platformdirs 4.2.2 pyhd8ed1ab_0 conda-forge prometheus_client 0.20.0 pyhd8ed1ab_0 conda-forge prompt-toolkit 3.0.47 pyha770c72_0 conda-forge proto-plus 1.23.0 pyhd8ed1ab_0 conda-forge protobuf 4.25.3 py312h72fbbdf_0 conda-forge psutil 6.0.0 py312h9a8786e_0 conda-forge pthread-stubs 0.4 h36c2ea0_1001 conda-forge ptyprocess 0.7.0 pyhd3deb0d_0 conda-forge pure_eval 0.2.3 pyhd8ed1ab_0 conda-forge py-cpuinfo 9.0.0 pyhd8ed1ab_0 conda-forge pyarrow 17.0.0 py312h9cebb41_1 conda-forge pyarrow-core 17.0.0 py312h9cafe31_1_cpu conda-forge pyarrow-hotfix 0.6 pyhd8ed1ab_0 conda-forge pyasn1 0.6.0 pyhd8ed1ab_0 conda-forge pyasn1-modules 0.4.0 pyhd8ed1ab_0 conda-forge pycairo 1.26.1 py312h3bc4990_0 conda-forge pycparser 2.22 pyhd8ed1ab_0 conda-forge pydantic 2.8.2 pyhd8ed1ab_0 conda-forge pydantic-core 2.20.1 py312hf008fa9_0 conda-forge pyedr 0.8.0 pyhd8ed1ab_0 conda-forge pygments 2.18.0 pyhd8ed1ab_0 conda-forge pymbar 3.1.1 py312hcc4bcb2_3 conda-forge pymsmt 22.0 pypi_0 pypi pyopenssl 24.2.1 pyhd8ed1ab_2 conda-forge pyparsing 3.1.2 pyhd8ed1ab_0 conda-forge pysocks 1.7.1 pyha2e5f31_6 conda-forge pytables 3.9.2 py312hf20fedc_3 conda-forge python 3.12.4 h194c7f8_0_cpython conda-forge python-constraint 1.4.0 py_0 conda-forge python-dateutil 2.9.0 pyhd8ed1ab_0 conda-forge python-fastjsonschema 2.20.0 pyhd8ed1ab_0 conda-forge python-json-logger 2.0.7 pyhd8ed1ab_0 conda-forge python-tzdata 2024.1 pyhd8ed1ab_0 conda-forge python_abi 3.12 4_cp312 conda-forge pytraj 2.0.6 pypi_0 pypi pytz 2024.1 pyhd8ed1ab_0 conda-forge pyu2f 0.1.5 pyhd8ed1ab_0 conda-forge pyyaml 6.0.1 py312h98912ed_1 conda-forge pyzmq 26.0.3 py312h8fd38d8_0 conda-forge qhull 2020.2 h434a139_5 conda-forge rdkit 2024.03.5 py312h7b4b7d0_2 conda-forge re2 2023.09.01 h7f4b329_2 conda-forge readline 8.2 h8228510_1 conda-forge referencing 0.35.1 pyhd8ed1ab_0 conda-forge reportlab 4.2.2 py312h9a8786e_0 conda-forge requests 2.32.3 pyhd8ed1ab_0 conda-forge rfc3339-validator 0.1.4 pyhd8ed1ab_0 conda-forge rfc3986-validator 0.1.1 pyh9f0ad1d_0 conda-forge rich 13.7.1 pyhd8ed1ab_0 conda-forge rlpycairo 0.2.0 pyhd8ed1ab_0 conda-forge rpds-py 0.19.1 py312hf008fa9_0 conda-forge rsa 4.9 pyhd8ed1ab_0 conda-forge s2n 1.4.17 he19d79f_0 conda-forge s3transfer 0.10.2 pyhd8ed1ab_0 conda-forge sander 22.0 pypi_0 pypi scipy 1.14.0 py312hc2bc53b_1 conda-forge seaborn 0.13.2 hd8ed1ab_2 conda-forge seaborn-base 0.13.2 pyhd8ed1ab_2 conda-forge send2trash 1.8.3 pyh0d859eb_0 conda-forge setuptools 72.1.0 pyhd8ed1ab_0 conda-forge six 1.16.0 pyh6c4a22f_0 conda-forge smirnoff99frosst 1.1.0 pyh44b312d_0 conda-forge snappy 1.2.1 ha2e4443_0 conda-forge sniffio 1.3.1 pyhd8ed1ab_0 conda-forge snowballstemmer 2.2.0 pyhd8ed1ab_0 conda-forge sortedcontainers 2.4.0 pyhd8ed1ab_0 conda-forge soupsieve 2.5 pyhd8ed1ab_1 conda-forge sphinx 8.0.2 pyhd8ed1ab_0 conda-forge sphinxcontrib-applehelp 2.0.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-devhelp 2.0.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-htmlhelp 2.1.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-jsmath 1.0.1 pyhd8ed1ab_0 conda-forge sphinxcontrib-qthelp 2.0.0 pyhd8ed1ab_0 conda-forge sphinxcontrib-serializinghtml 1.1.10 pyhd8ed1ab_0 conda-forge sqlalchemy 2.0.31 py312h9a8786e_0 conda-forge stack_data 0.6.2 pyhd8ed1ab_0 conda-forge statsmodels 0.14.2 py312h085067d_0 conda-forge tabulate 0.9.0 pyhd8ed1ab_1 conda-forge tblib 3.0.0 pyhd8ed1ab_0 conda-forge termcolor 2.4.0 pyhd8ed1ab_0 conda-forge terminado 0.18.1 pyh0d859eb_0 conda-forge tinycss2 1.3.0 pyhd8ed1ab_0 conda-forge tinydb 4.8.0 pyhd8ed1ab_0 conda-forge tk 8.6.13 noxft_h4845f30_101 conda-forge tomli 2.0.1 pyhd8ed1ab_0 conda-forge toolz 0.12.1 pyhd8ed1ab_0 conda-forge tornado 6.4.1 py312h9a8786e_0 conda-forge tqdm 4.66.4 pyhd8ed1ab_0 conda-forge traitlets 5.14.3 pyhd8ed1ab_0 conda-forge types-python-dateutil 2.9.0.20240316 pyhd8ed1ab_0 conda-forge typing-extensions 4.12.2 hd8ed1ab_0 conda-forge typing_extensions 4.12.2 pyha770c72_0 conda-forge typing_utils 0.1.0 pyhd8ed1ab_0 conda-forge tzdata 2024a h0c530f3_0 conda-forge uri-template 1.3.0 pyhd8ed1ab_0 conda-forge urllib3 2.2.2 pyhd8ed1ab_0 conda-forge validators 0.33.0 pyhd8ed1ab_0 conda-forge wcwidth 0.2.13 pyhd8ed1ab_0 conda-forge webcolors 24.6.0 pyhd8ed1ab_0 conda-forge webencodings 0.5.1 pyhd8ed1ab_2 conda-forge websocket-client 1.8.0 pyhd8ed1ab_0 conda-forge wheel 0.43.0 pyhd8ed1ab_1 conda-forge widgetsnbextension 4.0.11 pyhd8ed1ab_0 conda-forge xmltodict 0.13.0 pyhd8ed1ab_0 conda-forge xorg-kbproto 1.0.7 h7f98852_1002 conda-forge xorg-libice 1.1.1 hd590300_0 conda-forge xorg-libsm 1.2.4 h7391055_0 conda-forge xorg-libx11 1.8.9 hb711507_1 conda-forge xorg-libxau 1.0.11 hd590300_0 conda-forge xorg-libxdmcp 1.1.3 h7f98852_0 conda-forge xorg-libxext 1.3.4 h0b41bf4_2 conda-forge xorg-libxrender 0.9.11 hd590300_0 conda-forge xorg-libxt 1.3.0 hd590300_1 conda-forge xorg-renderproto 0.11.1 h7f98852_1002 conda-forge xorg-xextproto 7.3.0 h0b41bf4_1003 conda-forge xorg-xproto 7.0.31 h7f98852_1007 conda-forge xyzservices 2024.6.0 pyhd8ed1ab_0 conda-forge xz 5.2.6 h166bdaf_0 conda-forge yaml 0.2.5 h7f98852_2 conda-forge yarl 1.9.4 py312h98912ed_0 conda-forge zeromq 4.3.5 h75354e8_4 conda-forge zict 3.0.0 pyhd8ed1ab_0 conda-forge zipp 3.19.2 pyhd8ed1ab_0 conda-forge zlib 1.3.1 h4ab18f5_1 conda-forge zlib-ng 2.2.1 he02047a_0 conda-forge zstd 1.5.6 ha6fb4c9_0 conda-forge ```

Modifying the build to hard-code the versions suggested does not help:

RUN mamba create -n perses -c conda-forge perses==0.10.3 libnetcdf==4.8.1 netcdf4==1.5.8 hdf5==1.12.1 --yes

slochower commented 1 month ago

The root of the problem is this line in multistatereporter.py:

        energy_thermodynamic_states = np.array(self._storage_analysis.variables['energies'][iteration, :, :], np.float64)

I've done some experimentation:

        print(f"{self._storage_analysis.variables['energies']}")
        print(f"{iteration=}")
        print(f"{self._storage_analysis.variables['energies'][5000, :, :]}")

I can access the variables, the iteration array goes from 0 to 5000 and I can access random slices (tried 0, 2500, and 5000), but I just can't pass iteration to self._storage_analysis.variables['energies'][iteration, :, :].

<class 'netCDF4._netCDF4.Variable'>
float64 energies(iteration, replica, state)
    units: kT
    long_name: energies[iteration][replica][state] is the reduced (unitless) energy of replica 'replica' from iteration 'iteration' evaluated at the thermodynamic state 'state'.
unlimited dimensions: iteration
current shape = (5001, 11, 11)
filling on, default _FillValue of 9.969209968386869e+36 used
iteration=array([   0,    1,    2, ..., 4998, 4999, 5000])
[[-1.29720880e+02 -1.06412644e+02 -8.68876065e+01 -7.10937978e+01
  -5.89524040e+01 -5.02994688e+01 -4.74773121e+01 -4.29213739e+01
  -3.75215299e+01 -3.16007895e+01 -2.52852537e+01]
 [-1.09876480e+02 -8.90292592e+01 -7.18319638e+01 -5.82453564e+01
  -4.82243584e+01 -4.17104405e+01 -3.90224553e+01 -3.48557246e+01
  -2.99873280e+01 -2.46807542e+01 -1.90315799e+01]
 [-1.42022761e+02 -1.17760430e+02 -9.72918725e+01 -8.05799502e+01
  -6.75841531e+01 -5.82571869e+01 -5.25607059e+01 -4.61449104e+01
  -3.92171379e+01 -3.18633149e+01 -2.41136809e+01]
 [-1.40570162e+02 -1.18864381e+02 -1.01053946e+02 -8.71030389e+01
  -7.69738141e+01 -7.06250332e+01 -6.66762546e+01 -6.17124068e+01
  -5.61005116e+01 -5.00026371e+01 -4.34873603e+01]
 [-1.08551139e+02 -8.98487577e+01 -7.47651773e+01 -6.32538257e+01
  -5.52512046e+01 -5.06455714e+01 -4.56226063e+01 -4.00905612e+01
  -3.41697477e+01 -2.79082226e+01 -2.13180199e+01]
 [ 4.79983354e+04  4.80219330e+04  4.80417857e+04  4.80579305e+04
   4.80704071e+04  4.80792603e+04 -2.26568036e+00 -4.40644015e+01
  -4.85315298e+01 -4.61389940e+01 -4.14915649e+01]
 [-1.16649336e+02 -9.49373962e+01 -7.69730368e+01 -6.27225845e+01
  -5.21520677e+01 -4.52275301e+01 -3.95081255e+01 -3.35468081e+01
  -2.73311390e+01 -2.08408950e+01 -1.40510407e+01]
 [-9.19894701e+01 -7.05452549e+01 -5.30271371e+01 -3.94004518e+01
  -2.96297210e+01 -2.36783436e+01 -3.79856474e+01 -3.77295656e+01
  -3.40887281e+01 -2.91130498e+01 -2.34002189e+01]
 [ 1.15717840e+03  1.17760782e+03  1.19431318e+03  1.20733092e+03
   1.21669995e+03  1.22246368e+03 -1.41898879e+01 -3.74575376e+01
  -3.90860804e+01 -3.61485943e+01 -3.15906226e+01]
 [ 1.83468715e+04  1.83686944e+04  1.83868478e+04  1.84013669e+04
   1.84122880e+04  1.84196487e+04  1.27506641e+01 -2.91744569e+01
  -3.41745931e+01 -3.23648010e+01 -2.83426114e+01]
 [-1.01017342e+02 -8.20892791e+01 -6.69641395e+01 -5.56041778e+01
  -4.79679676e+01 -4.40059951e+01 -4.02857014e+01 -3.57717347e+01
  -3.07568839e+01 -2.53445060e+01 -1.95663898e+01]]

slochower commented 1 month ago

It seems there is a size limit to indexing into the NetCDF file?

        # Retrieve energies at all thermodynamic states
        print(f"{self._storage_analysis.variables['energies']}")
        print(f"{iteration=}")
        for max_iteration in range(100, 5000, 100):
             _iteration = np.arange(0, max_iteration, 100)
             try:
                 self._storage_analysis.variables['energies'][_iteration, :, :]
                 print(f"{max_iteration=}\t Success")
             except RuntimeError:
                 print(f"{max_iteration=}\t Failed")
                 break

max_iteration=100    Success
max_iteration=200    Success
max_iteration=300    Success
max_iteration=400    Success
max_iteration=500    Success
max_iteration=600    Failed

mikemhenry commented 1 month ago

That is really weird...

It seems there is a size limit to indexing into the NetCDF file?

Maybe this is a silly question but there is space on the disk? Maybe add a os.system("df -h .") (assuming . is on the same file system where the NetCDF file is written) just to make sure there is plenty of disk space.

slochower commented 1 month ago

Doing some experiments to see if this could be disk space related... I'll update this comment.

Good thought @mikemhenry but I don't think it is disk space-related. I'm now running in a container that reports 703 GB available to /. I see the same behavior:

max_iteration=100    Success
max_iteration=200    Success
max_iteration=300    Success
max_iteration=400    Success
max_iteration=500    Success
max_iteration=600    Failed

If I modify the lines slightly:

        for max_iteration in range(100, 5000, 10):
             _iteration = np.arange(0, max_iteration, 1)
             try:
                 self._storage_analysis.variables['energies'][_iteration, :, :]
                 print(f"{max_iteration=}\t Success")
             except RuntimeError:
                 print(f"{max_iteration=}\t Failed")
                 break

Now it stops working around ~460.

max_iteration=100    Success
max_iteration=110    Success
max_iteration=120    Success
max_iteration=130    Success
max_iteration=140    Success
max_iteration=150    Success
max_iteration=160    Success
max_iteration=170    Success
max_iteration=180    Success
max_iteration=190    Success
max_iteration=200    Success
max_iteration=210    Success
max_iteration=220    Success
max_iteration=230    Success
max_iteration=240    Success
max_iteration=250    Success
max_iteration=260    Success
max_iteration=270    Success
max_iteration=280    Success
max_iteration=290    Success
max_iteration=300    Success
max_iteration=310    Success
max_iteration=320    Success
max_iteration=330    Success
max_iteration=340    Success
max_iteration=350    Success
max_iteration=360    Success
max_iteration=370    Success
max_iteration=380    Success
max_iteration=390    Success
max_iteration=400    Success
max_iteration=410    Success
max_iteration=420    Success
max_iteration=430    Success
max_iteration=440    Success
max_iteration=450    Success
max_iteration=460    Failed

I'm not even sure what to think about this -- at some point it just stops being able to index into the file? If I remove the break and let it keep going, it never works after that. Do think I need to compile the NetCDF libraries manually...?

I can get it to fail on multiple edges all at 460 (if I change the for loop to increment by 1, it stops at 456). Could it be that something is happening at this point for a bunch of edges? Some memory threshold? Peak memory usage is, I think, just around 256 MB:

/usr/bin/time -v /root/conda/envs/perses/bin/python test.py
[...]
Command exited with non-zero status 1
    Command being timed: "/root/conda/envs/perses/bin/python test.py"
    User time (seconds): 6.72
    System time (seconds): 0.15
    Percent of CPU this job got: 578%
    Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.18
    Average shared text size (kbytes): 0
    Average unshared data size (kbytes): 0
    Average stack size (kbytes): 0
    Average total size (kbytes): 0
    Maximum resident set size (kbytes): 203932
    Average resident set size (kbytes): 0
    Major (requiring I/O) page faults: 0
    Minor (reclaiming a frame) page faults: 33261
    Voluntary context switches: 82
    Involuntary context switches: 407
    Swaps: 0
    File system inputs: 0
    File system outputs: 8
    Socket messages sent: 0
    Socket messages received: 0
    Signals delivered: 0
    Page size (bytes): 4096
    Exit status: 1

It always fails on the solvent trajectory, too, but I'm not sure if that's just because the solvent trajectory is read first.

slochower commented 1 month ago

Confirmed that it happens in the libhdf5 C code.

(gdb) where
#0  0x00007b17a60cdf14 in H5F_addr_decode () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#1  0x00007b17a62bcfb8 in H5VL__native_blob_specific () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#2  0x00007b17a62a6b34 in H5VL__blob_specific.isra.0 () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#3  0x00007b17a62b5b03 in H5VL_blob_specific () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310

I suspect it has something to do with threading (https://forum.hdfgroup.org/t/segmentation-fault-in-h5dopen2-call-after-several-runs/10465/8, https://github.com/Unidata/netcdf4-python/issues/261, https://forum.hdfgroup.org/t/heap-use-after-free-by-the-call-h5aread-when-running-with-multiple-threads-in-hdf5-thread-safe-version/6405) but it still happens even if I set OMP_NUM_THREADS=1.

Do you have a working Dockerfile I can copy to see if this can be isolated? I don't think the Python library is dynamically linked to an OS packaged of netCDF4 or HDF5, but I'd like to verify (I tried playing with ldd on those .so files but couldn't quite figure out the right incantation).

mikemhenry commented 1 month ago

@slochower What do you need in the Dockerfile? We have a few images running in production that might have everything you need where we haven't seen this issue. This docker file https://github.com/openforcefield/alchemiscale/blob/v0.4.0/docker/alchemiscale-compute/Dockerfile with this environment yaml https://github.com/openforcefield/alchemiscale/blob/v0.4.0/devtools/conda-envs/alchemiscale-compute.yml might do everything you need (you can also pull the image with docker pull ghcr.io/openforcefield/alchemiscale-compute:v0.4.0

We have had issues with various versions and builds of the netCDF4 stack give us problems.

slochower commented 2 weeks ago

Ah! I solved it! I am not entirely and wholly clear on why this happens only inside Docker but I can report what triggered it and I can report how to avoid it.

For a long time, when analyzing Perses Simulation objects my workflow was to load the Simulation, test that free energies had been computed, and then pickle the object to analyze it later. That would look something like this:

    try:
        _sim = Simulation(path)
    except Exception as e:
        if "Non-global MBAR" in str(e):
            logger.warning(f"There is an issue with the energies "
                           f"in {path=}...")
        return
    # Test that we can actually compute free energies...
    _sim.historic_fes(500)
    if _sim.bindingdg:
        logger.info(f"Making pickle file {pickle_file=}...")
        pickle.dump(_sim, open(pickle_file, "wb"))

I was using 500 to look for 500 historic free energies which... well, depending on input configuration, might be too many. Trying to index into the NetCDF file beyond the length of the array should probably have led an IndexError, instead of a segmentation fault in the C library. Adjusting this value or just removing the check (I haven't seen historic_fes not be populated in years. This was from 2020.) resolves the issue.

choderalab / openmmtools

HDF5 error while trying to read trajectories with 0.23.0 and 0.23.1 #739