czimaginginstitute / MotionCor3

Anisotropic correction of beam induced sample motion for cryo-electron microscopy and tomography
BSD 3-Clause "New" or "Revised" License
41 stars 2 forks source link

motioncor2/motioncor3 Intermittently Produce All-Black Images #25

Closed jpellman closed 2 weeks ago

jpellman commented 2 weeks ago

We (SEMC) have been running into an issue where motioncor3 occasionally will output an entirely black image when run on our SLURM cluster. This issue is not triggered by any obvious preconditions and occurs seemingly randomly. We've determined that running the same command multiple times will sometimes produce the desired result (a motion-corrected image) while other times the output will be blank. Over the course of many invocations the issue appears to be somewhat uncommon (my rough guess is that it maybe occurs 10% of the time at most). Overall, this is likely attributable to some non-obvious, non-deterministic behavior at some level.

Our system has the following specs:

jpellman@memc-gpu03:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Debian
Description:    Debian GNU/Linux 11 (bullseye)
Release:        11
Codename:       bullseye
jpellman@memc-gpu03:~$ nvidia-smi -q | grep "Driver Version"
Driver Version                            : 555.42.02
jpellman@memc-gpu03:~$ nvidia-smi -L
GPU 0: NVIDIA GeForce GTX 1080 (UUID: GPU-448923ee-d9e6-0ad1-67fe-d46fa998cfbb)
GPU 1: NVIDIA GeForce GTX 1080 (UUID: GPU-426a869f-9f30-035d-8636-a52896045ab6)
GPU 2: NVIDIA GeForce GTX 1080 (UUID: GPU-3a73e3e0-7022-7308-c590-090c93dfb7b8)
GPU 3: NVIDIA GeForce GTX 1080 (UUID: GPU-3ee3f4ef-fbbc-c351-a96f-c8f18549275d)

motioncor3 is running within an enroot container (similar to Apptainer/Singularity) with CentOS 7 userland components. The version of CUDA used is 12.1.

When we look at the stdout / logs for the runs that output black images, we see questionable descriptive statistics similar to the following:

Mean & Std: 110211549464887296.00      inf
Hot pixel threshold:      inf

We've tried running with motioncor2 as well, but the issue still persists. Our current best guess is that the problem lies with GCalcMoment2D. One suspicion I have is that shared memory is polluted with unrelated data from other applications and that this is corrupting the results, but I'm not sure.

I've attached the stdout / log from a failed run for further reference. We'd be most grateful for any assistance you can provide on your end! We'll continue investigating on our side as well in case there's something obvious that we missed.

m24jun04c_00004hl_00002enn_st_Log.motioncor2.txt

jpellman commented 2 weeks ago

This is not a software problem so I am going to close this out. The root cause of our issue was that one of our GPUs was silently failing. Specifically, the embedded GDDR RAM on one of the 1080s is bad and was causing motioncor to read incorrect values (ergo, the questionable mean value). We confirmed component failure by running memtestG80 earlier this morning.

The non-deterministic behavior that caused this to happen erratically (rather than consistently) was at the level of the SLURM scheduler. The SLURM scheduler would alternatively schedule motioncor jobs between different GPUs, and jobs would sometimes land on the GPU with bad RAM w/ 0.25 probability. In retrospect, it should have been fairly obvious that this was where the randomness was introduced- sorry for the noise!