Closed slochower closed 2 weeks ago
@slochower Thanks for the feedback!
If I'm understanding this correctly, this is only happening in the AWS instance, but locally things work fine, correct?
Ah, yes, I should have been more explicit. This is happening with a Docker image on AWS or running the x86-64 Docker image locally. I've tried FROM --platform=linux/amd64 nvidia/cuda:12.5.1-base-ubuntu24.04
and FROM --platform=linux/amd64 nvidia/cuda:12.2.0-base-ubuntu20.04
followed by installation of either Miniconda3-latest-Linux-x86_64.sh
or Miniforge3-Linux-x86_64.sh
and then creating the environment with either conda
or mamba
. We are really stuck figuring out the right dependency stack. This environment is used for running & analyzing the edges.
Locally I have an Arm environment on Mac, built from the same environment.yaml
file that doesn't have issues running the analysis.
Here is some more information. I can now reproduce the issue (in more severe form) with this more minimal Dockerfile:
FROM --platform=linux/amd64 nvidia/cuda:12.5.1-base-ubuntu24.04
RUN apt-get update && \
DEBIAN_FRONTEND=noninteractive TZ="America/New_York" \
apt-get install -y \
build-essential \
graphviz \
graphviz-dev \
groff \
libxrender1 \
wget \
&& rm -rf /var/lib/apt/lists/*
RUN wget \
"https://github.com/conda-forge/miniforge/releases/latest/download/Miniforge3-Linux-x86_64.sh" \
&& bash Miniforge3-Linux-x86_64.sh -b -p "/root/conda" \
&& rm -f Miniforge3-Linux-x86_64.sh
ENV PATH /root/conda/bin:$PATH
# The key change here is to get the environment not from the main (public)
# GitHub repository, but rather, from the conda-forge build recipe.
# This is similar but a little more restrictive:
# https://github.com/conda-forge/perses-feedstock/blob/fee858d534528566a60f8e8a9273870cae6f93a7/recipe/meta.yaml
RUN mamba create -n perses -c conda-forge -c openeye perses==0.10.3 openeye-toolkits --yes
If I try to read a trajectory with DEBUG logging, I see:
[...]
Both vacuum and solvent legs need to be run for hydration free energies
WARNING:openmmtools.multistate.multistatereporter:Warning: The openmmtools.multistate API is experimental and may change in future releases
DEBUG:openmmtools.multistate.multistatereporter:Initial checkpoint file automatically chosen as lig0to2/nf-tower-solvent_checkpoint.nc
WARNING:openmmtools.multistate.multistateanalyzer:Warning: The openmmtools.multistate API is experimental and may change in future releases
Segmentation fault (core dumped)
Debug symbols are missing from the core file, but I think it's coming from the underlying C library. I'm going to keep troubleshooting the environment, but I'm really really curious why this is happening in a somewhat controlled environment. Would it be possible for you to send a conda env export
from a linux x86-64 host that you know is working?
What is bizarre is that while I've been making small tweaks to the Dockerfile, I've been using the test.py
file (above) and in the current container version, that works file. I can read that trajectory (lig0to8
) but if I try another one (lig12to15
) -- again, all written from the same environment in the same container -- now I get an HDF5 error for this trajectory (not a seg fault, though!). It does not seem deterministic which container will be unable to read a given trajectory, but whenever I build a container with different versions of netcdf-related libraries, if that container can't load a given trajectory, then it is reproducible.
Here is the conda list
for the environment that's not working.
Modifying the build to hard-code the versions suggested does not help:
RUN mamba create -n perses -c conda-forge perses==0.10.3 libnetcdf==4.8.1 netcdf4==1.5.8 hdf5==1.12.1 --yes
The root of the problem is this line in multistatereporter.py
:
energy_thermodynamic_states = np.array(self._storage_analysis.variables['energies'][iteration, :, :], np.float64)
I've done some experimentation:
print(f"{self._storage_analysis.variables['energies']}")
print(f"{iteration=}")
print(f"{self._storage_analysis.variables['energies'][5000, :, :]}")
I can access the variables, the iteration array goes from 0 to 5000 and I can access random slices (tried 0, 2500, and 5000), but I just can't pass iteration
to self._storage_analysis.variables['energies'][iteration, :, :]
.
<class 'netCDF4._netCDF4.Variable'>
float64 energies(iteration, replica, state)
units: kT
long_name: energies[iteration][replica][state] is the reduced (unitless) energy of replica 'replica' from iteration 'iteration' evaluated at the thermodynamic state 'state'.
unlimited dimensions: iteration
current shape = (5001, 11, 11)
filling on, default _FillValue of 9.969209968386869e+36 used
iteration=array([ 0, 1, 2, ..., 4998, 4999, 5000])
[[-1.29720880e+02 -1.06412644e+02 -8.68876065e+01 -7.10937978e+01
-5.89524040e+01 -5.02994688e+01 -4.74773121e+01 -4.29213739e+01
-3.75215299e+01 -3.16007895e+01 -2.52852537e+01]
[-1.09876480e+02 -8.90292592e+01 -7.18319638e+01 -5.82453564e+01
-4.82243584e+01 -4.17104405e+01 -3.90224553e+01 -3.48557246e+01
-2.99873280e+01 -2.46807542e+01 -1.90315799e+01]
[-1.42022761e+02 -1.17760430e+02 -9.72918725e+01 -8.05799502e+01
-6.75841531e+01 -5.82571869e+01 -5.25607059e+01 -4.61449104e+01
-3.92171379e+01 -3.18633149e+01 -2.41136809e+01]
[-1.40570162e+02 -1.18864381e+02 -1.01053946e+02 -8.71030389e+01
-7.69738141e+01 -7.06250332e+01 -6.66762546e+01 -6.17124068e+01
-5.61005116e+01 -5.00026371e+01 -4.34873603e+01]
[-1.08551139e+02 -8.98487577e+01 -7.47651773e+01 -6.32538257e+01
-5.52512046e+01 -5.06455714e+01 -4.56226063e+01 -4.00905612e+01
-3.41697477e+01 -2.79082226e+01 -2.13180199e+01]
[ 4.79983354e+04 4.80219330e+04 4.80417857e+04 4.80579305e+04
4.80704071e+04 4.80792603e+04 -2.26568036e+00 -4.40644015e+01
-4.85315298e+01 -4.61389940e+01 -4.14915649e+01]
[-1.16649336e+02 -9.49373962e+01 -7.69730368e+01 -6.27225845e+01
-5.21520677e+01 -4.52275301e+01 -3.95081255e+01 -3.35468081e+01
-2.73311390e+01 -2.08408950e+01 -1.40510407e+01]
[-9.19894701e+01 -7.05452549e+01 -5.30271371e+01 -3.94004518e+01
-2.96297210e+01 -2.36783436e+01 -3.79856474e+01 -3.77295656e+01
-3.40887281e+01 -2.91130498e+01 -2.34002189e+01]
[ 1.15717840e+03 1.17760782e+03 1.19431318e+03 1.20733092e+03
1.21669995e+03 1.22246368e+03 -1.41898879e+01 -3.74575376e+01
-3.90860804e+01 -3.61485943e+01 -3.15906226e+01]
[ 1.83468715e+04 1.83686944e+04 1.83868478e+04 1.84013669e+04
1.84122880e+04 1.84196487e+04 1.27506641e+01 -2.91744569e+01
-3.41745931e+01 -3.23648010e+01 -2.83426114e+01]
[-1.01017342e+02 -8.20892791e+01 -6.69641395e+01 -5.56041778e+01
-4.79679676e+01 -4.40059951e+01 -4.02857014e+01 -3.57717347e+01
-3.07568839e+01 -2.53445060e+01 -1.95663898e+01]]
It seems there is a size limit to indexing into the NetCDF file?
# Retrieve energies at all thermodynamic states
print(f"{self._storage_analysis.variables['energies']}")
print(f"{iteration=}")
for max_iteration in range(100, 5000, 100):
_iteration = np.arange(0, max_iteration, 100)
try:
self._storage_analysis.variables['energies'][_iteration, :, :]
print(f"{max_iteration=}\t Success")
except RuntimeError:
print(f"{max_iteration=}\t Failed")
break
max_iteration=100 Success
max_iteration=200 Success
max_iteration=300 Success
max_iteration=400 Success
max_iteration=500 Success
max_iteration=600 Failed
That is really weird...
It seems there is a size limit to indexing into the NetCDF file?
Maybe this is a silly question but there is space on the disk? Maybe add a os.system("df -h .")
(assuming .
is on the same file system where the NetCDF file is written) just to make sure there is plenty of disk space.
Doing some experiments to see if this could be disk space related... I'll update this comment.
Good thought @mikemhenry but I don't think it is disk space-related. I'm now running in a container that reports 703 GB available to /
. I see the same behavior:
max_iteration=100 Success
max_iteration=200 Success
max_iteration=300 Success
max_iteration=400 Success
max_iteration=500 Success
max_iteration=600 Failed
If I modify the lines slightly:
for max_iteration in range(100, 5000, 10):
_iteration = np.arange(0, max_iteration, 1)
try:
self._storage_analysis.variables['energies'][_iteration, :, :]
print(f"{max_iteration=}\t Success")
except RuntimeError:
print(f"{max_iteration=}\t Failed")
break
Now it stops working around ~460.
max_iteration=100 Success
max_iteration=110 Success
max_iteration=120 Success
max_iteration=130 Success
max_iteration=140 Success
max_iteration=150 Success
max_iteration=160 Success
max_iteration=170 Success
max_iteration=180 Success
max_iteration=190 Success
max_iteration=200 Success
max_iteration=210 Success
max_iteration=220 Success
max_iteration=230 Success
max_iteration=240 Success
max_iteration=250 Success
max_iteration=260 Success
max_iteration=270 Success
max_iteration=280 Success
max_iteration=290 Success
max_iteration=300 Success
max_iteration=310 Success
max_iteration=320 Success
max_iteration=330 Success
max_iteration=340 Success
max_iteration=350 Success
max_iteration=360 Success
max_iteration=370 Success
max_iteration=380 Success
max_iteration=390 Success
max_iteration=400 Success
max_iteration=410 Success
max_iteration=420 Success
max_iteration=430 Success
max_iteration=440 Success
max_iteration=450 Success
max_iteration=460 Failed
I'm not even sure what to think about this -- at some point it just stops being able to index into the file? If I remove the break
and let it keep going, it never works after that. Do think I need to compile the NetCDF libraries manually...?
I can get it to fail on multiple edges all at 460 (if I change the for loop to increment by 1, it stops at 456). Could it be that something is happening at this point for a bunch of edges? Some memory threshold? Peak memory usage is, I think, just around 256 MB:
/usr/bin/time -v /root/conda/envs/perses/bin/python test.py
[...]
Command exited with non-zero status 1
Command being timed: "/root/conda/envs/perses/bin/python test.py"
User time (seconds): 6.72
System time (seconds): 0.15
Percent of CPU this job got: 578%
Elapsed (wall clock) time (h:mm:ss or m:ss): 0:01.18
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 203932
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 33261
Voluntary context switches: 82
Involuntary context switches: 407
Swaps: 0
File system inputs: 0
File system outputs: 8
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 1
It always fails on the solvent trajectory, too, but I'm not sure if that's just because the solvent trajectory is read first.
Confirmed that it happens in the libhdf5
C code.
(gdb) where
#0 0x00007b17a60cdf14 in H5F_addr_decode () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#1 0x00007b17a62bcfb8 in H5VL__native_blob_specific () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#2 0x00007b17a62a6b34 in H5VL__blob_specific.isra.0 () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
#3 0x00007b17a62b5b03 in H5VL_blob_specific () from /root/miniconda3/envs/perses/lib/python3.11/site-packages/netCDF4/../../.././libhdf5.so.310
I suspect it has something to do with threading (https://forum.hdfgroup.org/t/segmentation-fault-in-h5dopen2-call-after-several-runs/10465/8, https://github.com/Unidata/netcdf4-python/issues/261, https://forum.hdfgroup.org/t/heap-use-after-free-by-the-call-h5aread-when-running-with-multiple-threads-in-hdf5-thread-safe-version/6405) but it still happens even if I set OMP_NUM_THREADS=1
.
Do you have a working Dockerfile I can copy to see if this can be isolated? I don't think the Python library is dynamically linked to an OS packaged of netCDF4 or HDF5, but I'd like to verify (I tried playing with ldd
on those .so
files but couldn't quite figure out the right incantation).
@slochower What do you need in the Dockerfile? We have a few images running in production that might have everything you need where we haven't seen this issue. This docker file https://github.com/openforcefield/alchemiscale/blob/v0.4.0/docker/alchemiscale-compute/Dockerfile with this environment yaml https://github.com/openforcefield/alchemiscale/blob/v0.4.0/devtools/conda-envs/alchemiscale-compute.yml might do everything you need (you can also pull the image with docker pull ghcr.io/openforcefield/alchemiscale-compute:v0.4.0
We have had issues with various versions and builds of the netCDF4
stack give us problems.
Ah! I solved it! I am not entirely and wholly clear on why this happens only inside Docker but I can report what triggered it and I can report how to avoid it.
For a long time, when analyzing Perses Simulation
objects my workflow was to load the Simulation
, test that free energies had been computed, and then pickle the object to analyze it later. That would look something like this:
try:
_sim = Simulation(path)
except Exception as e:
if "Non-global MBAR" in str(e):
logger.warning(f"There is an issue with the energies "
f"in {path=}...")
return
# Test that we can actually compute free energies...
_sim.historic_fes(500)
if _sim.bindingdg:
logger.info(f"Making pickle file {pickle_file=}...")
pickle.dump(_sim, open(pickle_file, "wb"))
I was using 500
to look for 500 historic free energies which... well, depending on input configuration, might be too many. Trying to index into the NetCDF file beyond the length of the array should probably have led an IndexError
, instead of a segmentation fault in the C library. Adjusting this value or just removing the check (I haven't seen historic_fes
not be populated in years. This was from 2020.) resolves the issue.
Hi, we're seeing HDF errors when trying to load Perses simulations with
openmmtools
0.23.0 (fromconda
) or 0.23.1 (frompip install
this repo).I think it is very similar to https://github.com/choderalab/openmmtools/issues/666 except we have tried the suggestions and they are not working for us. Specifically, we're installing a Perses environment from this file (with mamba) followed by a local
pip install
of Perses itself. Even if we forcelibnetcdf==4.8.1
,netcdf4==1.5.8
, andhdf5==1.12.1
(in the environment YAML or after activating the environment), this is what we see when trying to read a trajectory:There are a few observations I find confusing:
conda
environment -- and therefore the same NetCDF libraries. So somehownetcdf
/hdf5
are writing trajectories that they later can't read in.