3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
450 stars 201 forks source link

GPU job from RELION not accepted on older card (sm 3.5) #48

Closed bforsbe closed 7 years ago

bforsbe commented 8 years ago

Originally reported by: AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime)


We have an older graphics card in a workstation which should still be able to do a good job for non-titan data-sets.

Card:

Despite no problems during build we were unable to run any GPU jobs from RELION .

Problem:

Note:

We tried to corner the problem - with no positive result:

Question:

Suggestion:

More detailed description of where the GPU job hangs / system:

Top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13901 heuer 20 0 327m 22m 7856 R 99.9 0.0 0:41.34 relion_refine_mpi --o Class2D/j.. 13903 heuer 20 0 72.3g 18m 9224 R 99.9 0.0 0:41.34 relion_refine_mpi --o Class2D/j.. 13902 heuer 20 0 72.3g 20m 9312 R 99.5 0.0 0:41.31 relion_refine_mpi --o Class2D/j..

Simple GPU info query:

You have 2 nVidia GPGPU.


bforsbe commented 8 years ago

Original comment by AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime):


resolved by recent build (f1391d3 - v2.0.b9 )

bforsbe commented 8 years ago

Original comment by AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime):


I just rebuild and retried. Works fine with most recent build (f1391d3 - v2.0.b9 )

Thank you Bjoern & Sjors for the quick reply - indeed that was fixed with the most recent build ~4h ago

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Also, when using relion_refine_mpi, you should specify the number of working "slave" ranks, and add one rank to act as "master", as always. Since I get the impression you want to use both GPUs, you should use

#!bash 
mpirun -n 3 relion_refine_mpi --gpu

and avoid specifying device-selection syntax like --gpu 0, which will tell the first rank to use only GPU with index 0.

You probably just tried a single slave as a part of your trials to find the cause of this unwanted behavior, but I still want to emphasize that running mpirun -n 2 is effectively the same as not using MPI at all, and should be avoided. While you may hide some latency, you incur communication and memory penalties.

This is not related to the issue you are seeing, but is worth noting.

bforsbe commented 8 years ago

Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):


Indeed, it sounds like the issue that was resolved yesterday...

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):


Which beta-version was this? Was it <v2.0.b8 ?