3dem / relion

Image-processing software for cryo-electron microscopy
https://relion.readthedocs.io/en/latest/
GNU General Public License v2.0
453 stars 201 forks source link

failed to create cufft plan #1080

Open walidabualafia opened 9 months ago

walidabualafia commented 9 months ago

This is a template for reporting bugs. Please fill in as much information as you can.

I have been using the relion/5.0-beta for a while now. I have a user who ran a very long job, which exits with a failed to create cufft plan. I am not sure what is causing this issue. Most functionality and behavior is correct, and this error just came up while the user was running relion.

Environment:

Dataset:

Job options:

Error message:

Please cite the full error message as the example below.


--------------------------------------------------------------------------
WARNING: There was an error initializing an OpenFabrics device.

  Local host:   nodegpu214
  Local device: mlx5_0
--------------------------------------------------------------------------
[nodegpu214:77459] 4 more processes have sent help message help-mpi-btl-openib.txt / error in device init
[nodegpu214:77459] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
in: /path/to/relion/vendor/relion/src/projector.cpp, line 362
ERROR:
failed to create cufft plan
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
=== Backtrace  ===
=== Backtrace  ===
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 4 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4ba969]
/path/to/relion/install/5.0/bin/relion_refine_mpi() [0x44bdd5]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbifPK13MultidimArrayIfE+0x8c3) [0x667033]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5c) [0x684c2c]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x3e4) [0x4dbd64]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0xd8) [0x4ed0f8]
/path/to/relion/install/5.0/bin/relion_refine_mpi(main+0x56) [0x4a9616]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x155537b45493]
/path/to/relion/install/5.0/bin/relion_refine_mpi(_start+0x2e) [0x4acb7e]
==================
ERROR:
failed to create cufft plan
[nodegpu214:77459] 3 more processes have sent help message help-mpi-api.txt / mpi-abort```
biochem-fan commented 9 months ago

I have a user who ran a very long job, which exits with a failed to create cufft plan. I am not sure what is causing this issue. Most functionality and behavior is correct, and this error just came up while the user was running relion.

Does this happen always for the particular user? What happens if the user continues the failed job?

Box size: 720 px Pixel size: 0.6485 Å/px

Unless the resolution is near 1.3 A, down-sample the particles. This is wasting the storage and processing power.

walidabualafia commented 9 months ago

I have not had any other users report this issue. I also asked around, and no users have seen it either.

This user encountered the error on 7 different jobs, which do not all contain the same particles. Whenever she hit the error, her batch job would preempt and exit. I'm not sure she is able to continue running the job. She did not encounter the error when she ran her job with version 4.0.1-commit-7809a7.

biochem-fan commented 9 months ago

Considering that A100 has a huge VRAM, it is not very likely that the program ran out of memory. Nonetheless it is worth trying down-sampled particles. I am sure the user does not need 0.6485 Å/px. With a more reasonable pixel size, the box size would be smaller, using less memory and leading to faster processing.