Closed eariascib closed 10 months ago
What happens if you ignore CTF until first peak, or completely disable CTF correction?
What if you set --maxsig 100
?
Thanks for the quick reply.
I obtain the same error disabling CTF correction and also enabling it with and without ignoring the CTF until the first peak.
I've tried to include the --maxsig 100 option with and without CTF correction and the error is quite similar. It has an additional line at the end of the run.out file, so I'm pasting it here in casi it's useful:
Will distribute threads over devices 0 1 Thread 0 mapped to device 0 Thread 1 mapped to device 1 Thread 2 mapped to device 0 Thread 3 mapped to device 1 Thread 4 mapped to device 0 Thread 5 mapped to device 1 Thread 6 mapped to device 0 Thread 7 mapped to device 1 Thread 8 mapped to device 0 Thread 9 mapped to device 1 Thread 10 mapped to device 0 Thread 11 mapped to device 1 Running CPU instructions in double precision.
Gradient optimisation iteration 3 of 200 with 200 particles (Step size 0.9) 000/??? sec ~~(,_,"> oo (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (65536B) (66560B) (65536B) (66560B) (65536B) (66560B) (65536B) <65536B> [9988082688B] = 9988830208B KERNEL_ERROR: an illegal memory access was encountered in /usr/local/relion_4.0_beta/src/acc/acc_ml_optimiser_impl.h at line 428 (error-code 77)
Still having the same problem — any thoughts / help / assistance please?
KERNEL_ERROR: an illegal memory access was encountered in /usr/local/git_source/relion-4.0-20220817/src/acc/acc_ml_optimiser_impl.h at line 428 (error-code 700)
Hi,
Since we had the issue, and for different reasons, we had to update the compiler and cuda to newer versions (gcc 9.4; cuda 11.1) and compile relion-4 again. Apparently, that has solved the issue, although we have not done extensive tests.
HTH, Ernesto.
Thank you for the suggestion.
I am running on ubuntu 20.04.01 — gcc is 9.4.0 and the latest version available via apt. cuda is 11.6 (already downgraded from 11.7) I am compiling against a standalone version of openmpi (to stop update conflicts) version-4.1.4
I will try some more permutations...!
Good morning,
I am trying to run the new VDAM 2D classification with some negative staining data and I am getting a GPU illegal memory access error. Strikingly, this only happens with negative staining data. The program seems to run fine with cryo data.
We have collected the data using a TVIPS CMOS detector on a 120 kV microscope. The images are saved in TIF format, but we converted them to mrc using EMAN2 (e2proc2d.py *.tif @.mrc). After that we have imported the images and performed all the preprocessing into Relion (CTFfind, Topaz, particle extraction without inverting the contrast).
Interestingly, the classical algorithm (EM) works fine, and it's only the new one that gets stalled at iter 3 and gives the error.
The only difference (besides the nature of the data and camera) during processing is that for the negative staining data we do not run motioncor. Maybe the new algorithm is more sensitive to hot pixels or other outliers in the images that are normally discarded/corrected during motioncor?
I would be very grateful if you could give me some assistance to solve this issue.
Environment: OS: Linux Mint 18.1. Linux kernel 4.4.0-53-generic MPI runtime: mpich-3.1.4 RELION version Relion4.0 Memory: 128 GB GPU: 2x GTX 1080Ti CUDA 8.0
Dataset:
Job options:
Full command (see
note.txt
in the job directory): we've tried several options, but they all get stuck at iter 3.Error message:
This is the output of the out file:
Will distribute threads over devices 0 1 Thread 0 mapped to device 0 Thread 1 mapped to device 1 Thread 2 mapped to device 0 Thread 3 mapped to device 1 Thread 4 mapped to device 0 Thread 5 mapped to device 1 Thread 6 mapped to device 0 Thread 7 mapped to device 1 Thread 8 mapped to device 0 Thread 9 mapped to device 1 Thread 10 mapped to device 0 Thread 11 mapped to device 1 Running CPU instructions in double precision.
On host osaka: free scratch space = 1862.08 Gb. Copying particles to scratch directory: /scratch2/ernesto/relion/relionvolatile/ 2/ 2 sec ............................................................~~(,,"> For opticsgroup 1, there are 18428 particles on the scratch disk. Estimating initial noise spectra 0/ 0 sec ............................................................~~(,,"> Estimating accuracies in the orientational assignment ... 1/ 1 sec ............................................................~~(,_,"> Auto-refine: Estimated accuracy angles= 20.1 degrees; offsets= 25.48 Angstroms CurrentResolution= 39.8222 Angstroms, which requires orientationSampling of at least 12.8571 degrees for a particle of diameter 350 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 67200 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 5.6 NrTranslations= 21
Oversampling= 1 NrHiddenVariableSamplingPoints= 2150400 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2.8 NrTranslations= 84
Gradient optimisation iteration 1 of 200 with 200 particles (Step size 0.9) 1/ 1 sec ............................................................~~(,,"> Maximization ... 0/ 0 sec ............................................................~~(,,"> CurrentResolution= 71.68 Angstroms, which requires orientationSampling of at least 22.5 degrees for a particle of diameter 350 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 67200 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 5.6 NrTranslations= 21
Oversampling= 1 NrHiddenVariableSamplingPoints= 2150400 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2.8 NrTranslations= 84
Gradient optimisation iteration 2 of 200 with 200 particles (Step size 0.9) 0/ 0 sec ............................................................~~(,,"> Maximization ... 0/ 0 sec ............................................................~~(,,"> CurrentResolution= 71.68 Angstroms, which requires orientationSampling of at least 22.5 degrees for a particle of diameter 350 Angstroms Oversampling= 0 NrHiddenVariableSamplingPoints= 67200 OrientationalSampling= 11.25 NrOrientations= 32 TranslationalSampling= 5.6 NrTranslations= 21
Oversampling= 1 NrHiddenVariableSamplingPoints= 2150400 OrientationalSampling= 5.625 NrOrientations= 256 TranslationalSampling= 2.8 NrTranslations= 84
Gradient optimisation iteration 3 of 200 with 200 particles (Step size 0.9) 000/??? sec ~~(,_,"> oo (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) [512B] (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (512B) (1536B) (65536B) (66560B) (65536B) (66560B) (65536B) <65536B> [9915867136B] = 9916478464B
######################
And this is the err file:
ERROR: an illegal memory access was encountered in /usr/local/relion_4.0_beta/src/acc/cuda/custom_allocator.cuh at line 175 (error-code 77) in: /usr/local/relion_4.0_beta/src/acc/cuda/cuda_settings.h, line 65 ERROR:
A GPU-function failed to execute.
If this occured at the start of a run, you might have GPUs which are incompatible with either the data or your installation of relion. If you
If this occurred at the middle or end of a run, it might be that
If none of the above applies, please report the error to the relion developers at github.com/3dem/relion/issues