Closed daniel-s-d-larsson closed 4 years ago
All of the reported error messages point to an issue in the runtime environment. Have you recently done a driver/hardware upgrade?
If all your refinements are failing, then you have a very reproducible issue, which is good. You should go back to a RELION version that you know was working fine. You can git checkout to a major version commit. If you still get this error then you'll know something has changed in you environment that is causing it.
I recently patched my system, since I had a long backlog of updates. That was probably a bad idea and likely the culprit to these problems.
This may be GPU related. I found this issue https://github.com/3dem/relion/issues/436, which has quite similar error output. It turned out to be caused by ram problems on the GPU. Is there a way to test the memory of the GPU, if it is hardware related? I'm currently running a job w/o GPUs and it has reached iteration 4 without any issues.
Checking /var/log/apt/history.log, it seems that the system upgraded the nvidia drivers nvidia-384-dev:amd64 from version 390.116-0ubuntu0.18.04.1 to version 390.116-0ubuntu0.18.04.3. From the name, it doesn't seems to be a major update, although I'm not very familiar with these things. Should I try to revert to version 390.116-0ubuntu0.18.04.1? I'm not really sure how to proceed.
Ok, now I ran https://github.com/ihaque/memtestG80 for 100 iterations on each of the two GTX 1080Ti cards without any errors, so hardware seems to be fine.
So, I tried reverting the Ubuntu package of the Nvidia drivers to the previous one, but to no avail.
Therefore I decided to run the hardware stress test for 1000 iterations and after about 120 iterations, one of the cards started to throw lots of errors and eventually completely locked up. So this was indeed a hardware problem.
Sounds like the issue is triggered by a temperature induced hardware problem. Glad you found the issue.
I try to run auto-refine jobs, but they either crashes or stalls on the first expectation step (100% CPU utilization, 0% GPU, nothing written out to run.out). When it crashes, the error is either Segmentation fault or custom cuda allocator error.
Things I tested:
Could this be a hardware problem?
Environment:
Dataset:
Job options:
Example 1 Run command:
relion_refine_mpi --o Refine3D/job124/run --auto_refine --split_random_halves --i Select/job120/particles.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job124/
Error message (run.out):
Error message (run.err):
Example 2 Run command:
relion_refine_mpi --o Refine3D/job125/run --auto_refine --split_random_halves --i CtfRefine/job116/particles_ctf_refine.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job125/
Error message:
Example 3 Run command:
relion_refine_mpi --o Refine3D/job127/run --auto_refine --split_random_halves --i CtfRefine/job116/particles_ctf_refine.star --ref Refine3D/job112/run_class001.mrc --ini_high 30 --dont_combine_weights_via_disc --no_parallel_disc_io --scratch_dir /scratch/relion1 --pool 100 --pad 1 --ctf --ctf_corrected_ref --particle_diameter 310 --flatten_solvent --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale --j 16 --gpu "" --maxsig 1000 --pipeline_control Refine3D/job127/
Error message: