3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
446 stars 200 forks source link

illegal memory access error during VDAM 2D classification #859

Closed eariascib closed 10 months ago

eariascib commented 2 years ago

Good morning,

I am trying to run the new VDAM 2D classification with some negative staining data and I am getting a GPU illegal memory access error. Strikingly, this only happens with negative staining data. The program seems to run fine with cryo data.

We have collected the data using a TVIPS CMOS detector on a 120 kV microscope. The images are saved in TIF format, but we converted them to mrc using EMAN2 (e2proc2d.py *.tif @.mrc). After that we have imported the images and performed all the preprocessing into Relion (CTFfind, Topaz, particle extraction without inverting the contrast).

Interestingly, the classical algorithm (EM) works fine, and it's only the new one that gets stalled at iter 3 and gives the error.

The only difference (besides the nature of the data and camera) during processing is that for the negative staining data we do not run motioncor. Maybe the new algorithm is more sensitive to hot pixels or other outliers in the images that are normally discarded/corrected during motioncor?

I would be very grateful if you could give me some assistance to solve this issue.

Environment: OS: Linux Mint 18.1. Linux kernel 4.4.0-53-generic MPI runtime: mpich-3.1.4 RELION version Relion4.0 Memory: 128 GB GPU: 2x GTX 1080Ti CUDA 8.0

Dataset:

Job options:

Error message:

This is the output of the out file:

Will distribute threads over devices 0 1 Thread 0 mapped to device 0 Thread 1 mapped to device 1 Thread 2 mapped to device 0 Thread 3 mapped to device 1 Thread 4 mapped to device 0 Thread 5 mapped to device 1 Thread 6 mapped to device 0 Thread 7 mapped to device 1 Thread 8 mapped to device 0 Thread 9 mapped to device 1 Thread 10 mapped to device 0 Thread 11 mapped to device 1 Running CPU instructions in double precision.

######################

And this is the err file:

ERROR: an illegal memory access was encountered in /usr/local/relion_4.0_beta/src/acc/cuda/custom_allocator.cuh at line 175 (error-code 77) in: /usr/local/relion_4.0_beta/src/acc/cuda/cuda_settings.h, line 65 ERROR:

A GPU-function failed to execute.

If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you

-> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
   and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
   this may happen.

-> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
   at least compute 3.5. You may be trying to use a GPU older than
   this. If you have multiple generations, try specifying --gpu <X>
   with X=0. Then try X=1 in a new run, and so on. The numbering of
   GPUs may not be obvious from the driver or intuition. For a list
   of GPU compute generations, see 

   en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

-> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
   as to not require this, and may thus have unforeseen requirements
   when run in this mode. If you think it is nonetheless necessary,
   please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

-> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
   subject to many restrictions, and relion is written to work within
   common restraints. If you have exotic data or settings, unexpected
   configurations may occur. See also above point regarding 
   double precision.

If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues

biochem-fan commented 2 years ago

What happens if you ignore CTF until first peak, or completely disable CTF correction? What if you set --maxsig 100?

eariascib commented 2 years ago

Thanks for the quick reply.

I obtain the same error disabling CTF correction and also enabling it with and without ignoring the CTF until the first peak.

I've tried to include the --maxsig 100 option with and without CTF correction and the error is quite similar. It has an additional line at the end of the run.out file, so I'm pasting it here in casi it's useful:

Will distribute threads over devices 0 1 Thread 0 mapped to device 0 Thread 1 mapped to device 1 Thread 2 mapped to device 0 Thread 3 mapped to device 1 Thread 4 mapped to device 0 Thread 5 mapped to device 1 Thread 6 mapped to device 0 Thread 7 mapped to device 1 Thread 8 mapped to device 0 Thread 9 mapped to device 1 Thread 10 mapped to device 0 Thread 11 mapped to device 1 Running CPU instructions in double precision.

drawoliver commented 1 year ago

Still having the same problem — any thoughts / help / assistance please?

KERNEL_ERROR: an illegal memory access was encountered in /usr/local/git_source/relion-4.0-20220817/src/acc/acc_ml_optimiser_impl.h at line 428 (error-code 700)

eariascib commented 1 year ago

Hi,

Since we had the issue, and for different reasons, we had to update the compiler and cuda to newer versions (gcc 9.4; cuda 11.1) and compile relion-4 again. Apparently, that has solved the issue, although we have not done extensive tests.

HTH, Ernesto.

drawoliver commented 1 year ago

Thank you for the suggestion.

I am running on ubuntu 20.04.01 — gcc is 9.4.0 and the latest version available via apt. cuda is 11.6 (already downgraded from 11.7) I am compiling against a standalone version of openmpi (to stop update conflicts) version-4.1.4

I will try some more permutations...!