3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
444 stars 197 forks source link

Crash with relion and GPU #1084

Open relion67 opened 7 months ago

relion67 commented 7 months ago

Hello I'm writing to you about a problem we're having with the relion program. We're trying to run processes with RELION version: 4.0-beta-1-commit-1fb5b8 on a centos 7.6 system. We're using a machine with 4 graphics cards (4 GPUs) and very regularly, when we tell the program to use all 4 GPUs: the program crashes ... If we only use 2 GPUs, the program takes an infinite amount of time to run... Recently, we had another problem of this type: Using this setting for 3Drefine Relion 4.0:

GPU 0,1 MPI 3 THREADS 6

error message

000/??? sec ~~(,_,"> [oo]ERROR: CudaCustomAllocator out of memory [requestedSpace: 340660736 B] [largestContinuousFreeSpace: 80307200 B] [totalFreeSpace: 80307200 B] (113152B) (114688B) (113152B) (114688B) (113152B) (114688B) (97472000B) (194943488B) (194943488B) (194943488B) (194943488B) (389886464B) (44544B) (2048B) (22528B) (5927424B) (173181440B) (346362880B) (346362880B) (346362880B) (346362880B) (692725248B) (44544B) (2048B) (22528B) (10531328B) (170330624B) (340660736B) (340660736B) (340660736B) [80307200B] = 4808391168B ERROR: CudaCustomAllocator out of memory [requestedSpace: 348200448 B] [largestContinuousFreeSpace: 129793536 B]

(113152B) (114688B) (113152B) (114688B) (113152B) (114688B) (174233088B) (348465664B) (348465664B) (348465664B) (348465664B) (696930816B) (44544B) (2048B) (22528B) (10595328B) (170590720B) (341181440B) (341181440B) (341181440B) (341181440B) (682362880B) (44544B) (2048B) (22528B) (10374144B) (174100480B) [129793536B] = 4808391168B

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = PID 44322 RUNNING AT serveur-linuxvixion = EXIT CODE: 139 = CLEANING UP REMAINING PROCESSES = YOU CAN IGNORE THE BELOW CLEANUP MESSAGES

YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11) This typically refers to a problem with your application. Please see the FAQ page for debugging suggestions RELION version: 4.0-beta-1-commit-1fb5b8 Precision: BASE=double

We'd like to upgrade to a higher version of relion but do you know if there are any constraints with the version of centos currently present on our machine? Thank you in advance for your help. Have a nice day!

biochem-fan commented 7 months ago

First of all, please respect our issue template. Without details of your dataset and hardware, we cannot provide a good answer.

This is a very common question. Please search "CudaCustomAllocator" in the CCPEM mailing list https://www.jiscmail.ac.uk/cgi-bin/webadmin?A0=CCPEM.

We'd like to upgrade to a higher version of relion

You should definitely do so. Why are you still using the beta version of 4.0?

relion67 commented 7 months ago

Hi Sorry for my presentation We ll check what you suggest us and let you know after bvest regards