donghuachensu commented 2 months ago

Hi Yun-Tao,

I got the following errors when I was using spIsoNet as external_reconstruct in Relion4 or Relion5-beta. Any suggestions? Thanks!

The following warnings were encountered upon command-line parsing: WARNING: Option --keep_lowres is not a valid RELION argument 04-15 19:26:44, INFO voxel_size 1.399999976158142 04-15 19:26:45, INFO voxel_size 1.399999976158142 04-15 19:39:35, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res . 04-15 19:39:35, INFO calculating fast 3DFSC, this will take few minutes

04-15 19:43:02, INFO voxel_size 1.399999976158142 04-15 19:51:23, INFO The Refine3D/job025 folder already exists, outputs will write into this folder 04-15 19:51:27, INFO voxel_size 1.399999976158142 04-15 19:51:32, INFO spIsoNet correction until resolution 10.0A! Information beyond 10.0A remains unchanged 04-15 19:54:34, INFO Start preparing subvolumes! 04-15 19:54:59, INFO Done preparing subvolumes! 04-15 19:54:59, INFO Start training! 04-15 19:55:02, INFO Port number: 45495 [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). Traceback (most recent call last): File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in sys.exit(main()) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main fire.Fire(ISONET) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir, File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000, File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta, File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join raise ProcessRaisedException(msg, error_index, failed_process.pid) torch.multiprocessing.spawn.ProcessRaisedException:

-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap fn(i, *args) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 50, in ddp_train model = model.cuda() File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in return self._apply(lambda t: t.cuda(device)) RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1. Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Traceback (most recent call last): File "/home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in with mrcfile.open(mrc1_cor) as d1: File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open return NewMrc(name, mode=mode, permissive=permissive, File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init self._open_file(name) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file self._iostream = open(name, self._mode + 'b') FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc' in: /home/groups/kornberg/donghuac/relion/src/backprojector.cpp, line 1323 ERROR: ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star === Backtrace === /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4e6a59] /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x46434e] /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1b02) [0x5222b2] /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x3e9) [0x523279] /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(main+0x55) [0x4d47b5] /lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff6f5fbb555] /home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x4d805e]

ERROR: ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

procyontao commented 2 months ago

Hi,

I tested relion4 but not relion5, but we think the relion5 blush regularization shares similarities with spIsoNet denoising.

This problem is probably related to failing to open a port, which is 45495 in your case. spIsoNet will automatically detect a port that is not been used for communication. If Anisotropy correction for half maps can be executed correctly, this RELION embedded spIsoNet should also work.

What I have in mind is to check what differs between the environment when you are running "spisonet.py reconstruct" and the relion wrapper. Such as whether the correct conda is used, or whether there are firewall problems.

donghuachensu commented 2 months ago

Hi,

Thanks for the reply. I also tested spIsoNet in Relion4 on my workstation (the previous one run on a cluster). Here is the error. Please take a look.

The following warnings were encountered upon command-line parsing: WARNING: Option --keep_lowres is not a valid RELION argument 04-16 14:59:29, INFO voxel_size 1.399999976158142 04-16 14:59:29, INFO voxel_size 1.399999976158142 04-16 15:04:45, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res . 04-16 15:04:45, INFO calculating fast 3DFSC, this will take few minutes

04-16 15:06:31, INFO voxel_size 1.399999976158142 04-16 15:09:55, INFO The Refine3D/job025 folder already exists, outputs will write into this folder 04-16 15:09:57, INFO voxel_size 1.399999976158142 04-16 15:10:00, INFO spIsoNet correction until resolution 10.0A! Information beyond 10.0A remains unchanged 04-16 15:11:06, INFO Start preparing subvolumes! 04-16 15:11:24, INFO Done preparing subvolumes! 04-16 15:11:24, INFO Start training! 04-16 15:11:24, INFO Port number: 44689 [rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out. [rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efd44c15d87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7efd45daed26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7efd45db227d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7efd45db2e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7efda2973bf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x9f802 (0x7efdb898d802 in /lib64/libc.so.6) frame #6: + 0x3f450 (0x7efdb892d450 in /lib64/libc.so.6)

[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out. [rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55840bcd87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f5585255d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f558525927d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f5585259e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f55e1e1abf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x9f802 (0x7f55f7e34802 in /lib64/libc.so.6) frame #6: + 0x3f450 (0x7f55f7dd4450 in /lib64/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out. [rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. [rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down. [rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out. Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aca5aed87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so) frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f8acb747d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f8acb74b27d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8acb74be79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so) frame #4: + 0xdbbf4 (0x7f8b2830cbf4 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/../../../.././libstdc++.so.6) frame #5: + 0x9f802 (0x7f8b3e326802 in /lib64/libc.so.6) frame #6: + 0x3f450 (0x7f8b3e2c6450 in /lib64/libc.so.6)

Traceback (most recent call last): File "/data/donghua/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in sys.exit(main()) File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main fire.Fire(ISONET) File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire component, remaining_args = _CallAndUpdateTrace( File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace component = fn(*varargs, kwargs) File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir, File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000, File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta, File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn return start_processes(fn, args, nprocs, join, daemon, start_method="spawn") File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes while not context.join(): File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 140, in join raise ProcessExitedException( torch.multiprocessing.spawn.ProcessExitedException: process 3 terminated with signal SIGABRT Traceback (most recent call last): File "/data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in with mrcfile.open(mrc1_cor) as d1: File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open return NewMrc(name, mode=mode, permissive=permissive, File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init** self._open_file(name) File "/data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file self._iostream = open(name, self._mode + 'b') FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc' in: /data/donghua/relion/src/backprojector.cpp, line 1294 ERROR: ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star === Backtrace === /data/donghua/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4be749] /data/donghua/relion/bin/relion_refine_mpi() [0x44d378] /data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x17a4) [0x4f4b14] /data/donghua/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x4c1) [0x4f5b01] /data/donghua/relion/bin/relion_refine_mpi(main+0x58) [0x4ad658] /lib64/libc.so.6(+0x3feb0) [0x7f8469dbfeb0] /lib64/libc.so.6(__libc_start_main+0x80) [0x7f8469dbff60] /data/donghua/relion/bin/relion_refine_mpi(_start+0x25) [0x4b0815]

ERROR: ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

donghuachensu commented 2 months ago

Hi, are the above two errors the same? I got one from a cluster and another one from the workstation. Any suggestions? Thanks!

procyontao commented 2 months ago

Hi,

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

I still does not have any understanding how the NCCL related error happens.

procyontao commented 2 months ago

Again I want to confirm whether the Anisotropy Correction ("spisonet.py reconstruct") gives you these errors within the same environment.

procyontao commented 2 months ago

Please also see whether this "https://github.com/IsoNet-cryoET/spIsoNet/issues/2" is related

donghuachensu commented 2 months ago

I can confirm that the Anisotropy Correction worked without any error on my 2-GPU workstation which has the same type of GPU as my 4-GPU workstation where I got the second error above.

olibclarke commented 2 months ago

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Is relion 5 compatibility on the roadmap? Or for now, would you recommend to set up a separate installation of relion 4 for misalignment correction?

DanGonite57 commented 2 months ago

I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.

Hi, are you sure that it does not work with RELION-5? Is this a matter of spIsoNet not running at all in RELION-5, or not giving the intended output? Because I have been running it through RELION-5 for initial testing before seeing this comment and it doesn't appear to have any issues, but I can't speak for whether it is producing "correct" results.

olibclarke commented 2 months ago

In our hands with relion 5 it seems to run, but the unfil.mrc and corrected.mrc maps are blank, leading to a crash after one iteration - this does not happen without --external_reconstruct. I haven't tried relion 4 yet

EDIT: What does seem to work in relion 5 is the following: Run a few iterations without --external_reconstruct. Kill the refinement, then continue the refinement from the last _optimiser.star, adding in the --external_reconstruct flag. Just tried this and it seems to work, and generates normal-looking external reconstruction volumes (can't verify yet whether it is helping!). Also it only seems to work if run in the spisonet conda env.

EDIT2:

Scratch that, I don't think it is actually doing anything. Here is the log:

 + Making system call for external reconstruction: python /home/user/software/spIsoNet/build/lib/spIsoNet/bin/relion_wrapper.py Refine3D/job012/run_it008_half1_class001_external_reconstruct.star
iter = 008
set CUDA_VISIBLE_DEVICES=None
set CONDA_ENV=spisonet
set ISONET_WHITENING=True
set ISONET_WHITENING_LOW=10
set ISONET_RETRAIN_EACH_ITER=True
set ISONET_BETA=0.5
set ISONET_ALPHA=1
set ISONET_START_HEALPIX=3
set ISONET_ACC_BATCHES=2
set ISONET_EPOCHS=5
set ISONET_KEEP_LOWRES=False
set ISONET_LOWPASS=True
set ISONET_ANGULAR_WHITEN=False
set ISONET_3DFSD=False
set ISONET_FSC_05=False
set ISONET_FSC_WEIGHTING=True
set ISONET_START_RESOLUTION=15.0
set ISONET_KEEP_LOWRES= False
healpix = 2
symmetry = C1
mask_file = mask.mrc
pixel size = 1.125
resolution at 0.5 and 0.143 are 7.384615 and 5.538462
real limit resolution to 5.538462
 + External reconstruction finished successfully, reading result back in ...

It seemingly runs and reconstructs, but never trains a model...

EDIT 3:

Nope, it is working - it just hadn't reached fine enough angular sampling. Working now. One thing I notice though - it defaults to using all GPUs - it would be better if somehow it could default to using the GPUs that have been assigned to this job in Relion (not sure if that is possible?)

donghuachensu commented 1 month ago

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!

procyontao commented 1 month ago

Hi,

The correct path is actually ~/spIsoNet/spIsoNet/bin/spisonet.py. All codes reside within the ~/spIsoNet/spIsoNet directory, so there's no need to move any files.

Hi All,

I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?

git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet

Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!

donghuachensu commented 1 month ago

Thank you for your clarification!

donghuachensu commented 1 month ago

I found that the first error above {RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable} was related to the GPU node's setting for Compute Mode of Default (e.g. on our cluster: #SBATCH --gpu_cmode=shared), and the second error above (Some NCCL operations have failed or timed out) was corrected by this setting (export NCCL_P2P_DISABLE=1).

procyontao commented 1 month ago

Thank you for trouble shooting and report back

donghuachensu commented 1 month ago

I wonder in this file spIsoNet_v1.0_Tutorial.pdf, whether one more step (pip install .) in Option 3 for the Installation should be added as the last step? Please confirm.

IsoNet-cryoET / spIsoNet

spIsoNet can't run as relion_external_reconstruct #3

ERROR: ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.

ERROR: ERROR: there was something wrong with system call: python /data/donghua/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.