GPU job from RELION not accepted on older card (sm 3.5)

bforsbe commented 8 years ago

Originally reported by: AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime)

We have an older graphics card in a workstation which should still be able to do a good job for non-titan data-sets.

Card:

Name / compute version / memory
GeForce GTX 780 Ti / 3.5 / 3020

Despite no problems during build we were unable to run any GPU jobs from RELION .

Problem:

Job is submitted but hangs after GPU initialization
2Dclassification job does not crash but just "hangs" after "Estimating initial noise spectra"
2Dclassification job does not continue into "Estimating accuracies in the orientational assignment ... " (2D classif)

Note:

other GPU calculations (eg. GCTF) work on this machine/setp (also submission from inside RELION )
we are able to run RELION -2-beta with using GPUs on other cards with compute version 5.2

We tried to corner the problem - with no positive result:

enforced "cmake -DCUDA_ARCH=35"
small amount of particles with small box (1000 particles 140 pix)
classical CPU 2D-classification works (so data is fine)
"waiting" ... job not received by GPU after 12h+
submitting to 2 or 1 of the present GPUs did not make any difference
nvidia-smi does not show any RELION process running or changes in memory usage

Question:

have there been other issues with older cards?
is there something we can to find the problem

Suggestion:

It would be nice to make RELION understand or check whether a job is not accepted/starting on the GPU and cause an error / abord the job.

More detailed description of where the GPU job hangs / system:

=== RELION MPI setup ===
- Number of MPI processes = 2
- Master (0) runs on host = cryosun
- Slave 1 runs on host = cryosun
=================
Running CPU instructions in double precision.
- WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
Estimating initial noise spectra
000/??? sec ~~(,_,"> [oo]
1/ 1 sec ...........~~(,_,">
uniqueHost cryosun has 1 ranks.
Using explicit indexing on slave 0 to assign devices 0
Thread 0 on slave 1 mapped to device 0
no further error or feedback from here on

Top:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 13901 heuer 20 0 327m 22m 7856 R 99.9 0.0 0:41.34 relion_refine_mpi --o Class2D/j.. 13903 heuer 20 0 72.3g 18m 9224 R 99.9 0.0 0:41.34 relion_refine_mpi --o Class2D/j.. 13902 heuer 20 0 72.3g 20m 9312 R 99.5 0.0 0:41.31 relion_refine_mpi --o Class2D/j..

Simple GPU info query:

You have 2 nVidia GPGPU.

DeviceID Name Version Memory(Mb)
0 GeForce GTX 780 Ti 3.5 3020
1 GeForce GTX 780 Ti 3.5 3020 Has Monitor

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/48

bforsbe commented 8 years ago

Original comment by AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime):

resolved by recent build (f1391d3 - v2.0.b9 )

bforsbe commented 8 years ago

Original comment by AndreHeuer (Bitbucket: Xenoprime, GitHub: Xenoprime):

I just rebuild and retried. Works fine with most recent build (f1391d3 - v2.0.b9 )

Thank you Bjoern & Sjors for the quick reply - indeed that was fixed with the most recent build ~4h ago

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Also, when using relion_refine_mpi, you should specify the number of working "slave" ranks, and add one rank to act as "master", as always. Since I get the impression you want to use both GPUs, you should use

#!bash 
mpirun -n 3 relion_refine_mpi --gpu

and avoid specifying device-selection syntax like --gpu 0, which will tell the first rank to use only GPU with index 0.

You probably just tried a single slave as a part of your trials to find the cause of this unwanted behavior, but I still want to emphasize that running mpirun -n 2 is effectively the same as not using MPI at all, and should be avoided. While you may hide some latency, you incur communication and memory penalties.

This is not related to the issue you are seeing, but is worth noting.

bforsbe commented 8 years ago

Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):

Indeed, it sounds like the issue that was resolved yesterday...

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Which beta-version was this? Was it <v2.0.b8 ?

3dem / relion

GPU job from RELION not accepted on older card (sm 3.5) #48