3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
440 stars 194 forks source link

Fail with GPU acceleration #786

Closed Neutrino0532 closed 3 years ago

Neutrino0532 commented 3 years ago

When running a GPU accelerating step, the computation crashes before computation starts, no matter using 1 or more GPUs. Computation runs normally and slowly if I don't use GPU acceleration.

Environment:

Dataset:

Job options:

Error message:

________________________________________________________________________________________________________________________________________ERROR: the provided PTX was compiled with an unsupported toolchain. in /home/user/software/relion/src/projector.cpp at line 204 (error-code 222)
in: /home/user/software/relion/src/acc/cuda/cuda_settings.h, line 67
ERROR: 

A GPU-function failed to execute.

 If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you 

    -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
       and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
       this may happen.

    -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
       at least compute 3.5. You may be trying to use a GPU older than
       this. If you have multiple generations, try specifying --gpu <X>
       with X=0. Then try X=1 in a new run, and so on. The numbering of
       GPUs may not be obvious from the driver or intuition. For a list
       of GPU compute generations, see 

       en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

    -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
       as to not require this, and may thus have unforeseen requirements
       when run in this mode. If you think it is nonetheless necessary,
       please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

    -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
       subject to many restrictions, and relion is written to work within
       common restraints. If you have exotic data or settings, unexpected
       configurations may occur. See also above point regarding 
       double precision.
If none of the above applies, please report the error to the relion
developers at    github.com/3dem/relion/issues

=== Backtrace  ===
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7d) [0x55ce2e9607ad]
/home/user/software/relion/build/bin/relion_refine_mpi(+0xfb576) [0x55ce2e9b1576]
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x1c0a) [0x55ce2e9b6e0a]
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x8f3) [0x55ce2ead1763]
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x60) [0x55ce2eae8510]
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3f9) [0x55ce2e97e0d9]
/home/user/software/relion/build/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x35d) [0x55ce2e98e16d]
/home/user/software/relion/build/bin/relion_refine_mpi(main+0x79) [0x55ce2e94f179]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0x7f790b5ce0b3]
/home/user/software/relion/build/bin/relion_refine_mpi(_start+0x2e) [0x55ce2e9525ae]
==================
ERROR: 

A GPU-function failed to execute.

 If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you 

    -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
       and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
       this may happen.

    -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
       at least compute 3.5. You may be trying to use a GPU older than
       this. If you have multiple generations, try specifying --gpu <X>
       with X=0. Then try X=1 in a new run, and so on. The numbering of
       GPUs may not be obvious from the driver or intuition. For a list
       of GPU compute generations, see 

       en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

    -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
       as to not require this, and may thus have unforeseen requirements
       when run in this mode. If you think it is nonetheless necessary,
       please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

    -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
       subject to many restrictions, and relion is written to work within
       common restraints. If you have exotic data or settings, unexpected
       configurations may occur. See also above point regarding 
       double precision.
If none of the above applies, please report the error to the relion
developers at    github.com/3dem/relion/issues

Abort(1) on node 3 (rank 3 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1) - process 3
________________________________________________________________________________________________________________________________________
biochem-fan commented 3 years ago

ERROR: the provided PTX was compiled with an unsupported toolchain

This means that your binary is incompatible with your card.

Did you specify CUDA_ARCH to cmake? Delete all files in your build folder and try again with CUDA_ARCH. Also make sure cmake picks up the right version of CUDA SDK. You might be linking against an older version of CUDA runtime.

Driver: 460.84 CUDA: 11.2.67

This combination should be fine but please make sure this is actually used. For example, use nvidia-smi.

Neutrino0532 commented 3 years ago

Delete all files in your build folder and try again with CUDA_ARCH

Assigning DCUDA_ARCH works! Thanks for your timely support!