Open donghuachensu opened 2 months ago
Hi,
I tested relion4 but not relion5, but we think the relion5 blush regularization shares similarities with spIsoNet denoising.
This problem is probably related to failing to open a port, which is 45495 in your case. spIsoNet will automatically detect a port that is not been used for communication. If Anisotropy correction for half maps can be executed correctly, this RELION embedded spIsoNet should also work.
What I have in mind is to check what differs between the environment when you are running "spisonet.py reconstruct" and the relion wrapper. Such as whether the correct conda is used, or whether there are firewall problems.
Hi,
Thanks for the reply. I also tested spIsoNet in Relion4 on my workstation (the previous one run on a cluster). Here is the error. Please take a look.
The following warnings were encountered upon command-line parsing: WARNING: Option --keep_lowres is not a valid RELION argument 04-16 14:59:29, INFO voxel_size 1.399999976158142 04-16 14:59:29, INFO voxel_size 1.399999976158142 04-16 15:04:45, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res . 04-16 15:04:45, INFO calculating fast 3DFSC, this will take few minutes
04-16 15:06:31, INFO voxel_size 1.399999976158142
04-16 15:09:55, INFO The Refine3D/job025 folder already exists, outputs will write into this folder
04-16 15:09:57, INFO voxel_size 1.399999976158142
04-16 15:10:00, INFO spIsoNet correction until resolution 10.0A!
Information beyond 10.0A remains unchanged
04-16 15:11:06, INFO Start preparing subvolumes!
04-16 15:11:24, INFO Done preparing subvolumes!
04-16 15:11:24, INFO Start training!
04-16 15:11:24, INFO Port number: 44689
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600701 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7efd44c15d87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7efd45daed26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7efd45db227d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7efd45db2e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank0]:[E ProcessGroupNCCL.cpp:523] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
[rank0]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E ProcessGroupNCCL.cpp:1182] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600714 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f55840bcd87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f5585255d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f558525927d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f5585259e79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=4, Timeout(ms)=600000) ran for 600837 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1711403388920/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f8aca5aed87 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f8acb747d26 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f8acb74b27d in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f8acb74be79 in /data/donghua/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4:
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
Hi, are the above two errors the same? I got one from a cluster and another one from the workstation. Any suggestions? Thanks!
Hi,
I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.
I still does not have any understanding how the NCCL related error happens.
Again I want to confirm whether the Anisotropy Correction ("spisonet.py reconstruct") gives you these errors within the same environment.
Please also see whether this "https://github.com/IsoNet-cryoET/spIsoNet/issues/2" is related
I can confirm that the Anisotropy Correction worked without any error on my 2-GPU workstation which has the same type of GPU as my 4-GPU workstation where I got the second error above.
I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.
Is relion 5 compatibility on the roadmap? Or for now, would you recommend to set up a separate installation of relion 4 for misalignment correction?
I now know that the "MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1...." problem is RELION reconstruction not properly performed. This will happen as spIsoNet does not works for RELION5.
Hi, are you sure that it does not work with RELION-5? Is this a matter of spIsoNet not running at all in RELION-5, or not giving the intended output? Because I have been running it through RELION-5 for initial testing before seeing this comment and it doesn't appear to have any issues, but I can't speak for whether it is producing "correct" results.
In our hands with relion 5 it seems to run, but the unfil.mrc
and corrected.mrc
maps are blank, leading to a crash after one iteration - this does not happen without --external_reconstruct
. I haven't tried relion 4 yet
EDIT:
What does seem to work in relion 5 is the following: Run a few iterations without --external_reconstruct
. Kill the refinement, then continue the refinement from the last _optimiser.star
, adding in the --external_reconstruct
flag. Just tried this and it seems to work, and generates normal-looking external reconstruction volumes (can't verify yet whether it is helping!). Also it only seems to work if run in the spisonet conda env.
EDIT2:
Scratch that, I don't think it is actually doing anything. Here is the log:
+ Making system call for external reconstruction: python /home/user/software/spIsoNet/build/lib/spIsoNet/bin/relion_wrapper.py Refine3D/job012/run_it008_half1_class001_external_reconstruct.star
iter = 008
set CUDA_VISIBLE_DEVICES=None
set CONDA_ENV=spisonet
set ISONET_WHITENING=True
set ISONET_WHITENING_LOW=10
set ISONET_RETRAIN_EACH_ITER=True
set ISONET_BETA=0.5
set ISONET_ALPHA=1
set ISONET_START_HEALPIX=3
set ISONET_ACC_BATCHES=2
set ISONET_EPOCHS=5
set ISONET_KEEP_LOWRES=False
set ISONET_LOWPASS=True
set ISONET_ANGULAR_WHITEN=False
set ISONET_3DFSD=False
set ISONET_FSC_05=False
set ISONET_FSC_WEIGHTING=True
set ISONET_START_RESOLUTION=15.0
set ISONET_KEEP_LOWRES= False
healpix = 2
symmetry = C1
mask_file = mask.mrc
pixel size = 1.125
resolution at 0.5 and 0.143 are 7.384615 and 5.538462
real limit resolution to 5.538462
+ External reconstruction finished successfully, reading result back in ...
It seemingly runs and reconstructs, but never trains a model...
EDIT 3:
Nope, it is working - it just hadn't reached fine enough angular sampling. Working now. One thing I notice though - it defaults to using all GPUs - it would be better if somehow it could default to using the GPUs that have been assigned to this job in Relion (not sure if that is possible?)
Hi All,
I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?
git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet
Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!
Hi,
The correct path is actually ~/spIsoNet/spIsoNet/bin/spisonet.py. All codes reside within the ~/spIsoNet/spIsoNet directory, so there's no need to move any files.
Hi All,
I used the recommended option 1 (as the following) in the tutorial to do the installation of spIsoNet, why I could not see the bin directory (which should contain the program spisonet.py) created just under the directory of spIsoNet after the installation?
git clone https://github.com/IsoNet-cryoET/spIsoNet.git conda env create -f setup.yml conda activate spisonet
Or just copy all the files in ~/spIsoNet/spIsoNet/bin/*.py to ~/spIsoNet/bin? Any suggestions? Thanks!
Thank you for your clarification!
I found that the first error above {RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable} was related to the GPU node's setting for Compute Mode of Default (e.g. on our cluster: #SBATCH --gpu_cmode=shared), and the second error above (Some NCCL operations have failed or timed out) was corrected by this setting (export NCCL_P2P_DISABLE=1).
Thank you for trouble shooting and report back
I wonder in this file spIsoNet_v1.0_Tutorial.pdf, whether one more step (pip install .) in Option 3 for the Installation should be added as the last step? Please confirm.
Hi Yun-Tao,
I got the following errors when I was using spIsoNet as external_reconstruct in Relion4 or Relion5-beta. Any suggestions? Thanks!
The following warnings were encountered upon command-line parsing: WARNING: Option --keep_lowres is not a valid RELION argument 04-15 19:26:44, INFO voxel_size 1.399999976158142 04-15 19:26:45, INFO voxel_size 1.399999976158142 04-15 19:39:35, INFO Limit resolution to 10.0 for spIsoNet 3D FSC calculation. You can also tune this paramerter with --limit_res . 04-15 19:39:35, INFO calculating fast 3DFSC, this will take few minutes
04-15 19:43:02, INFO voxel_size 1.399999976158142 04-15 19:51:23, INFO The Refine3D/job025 folder already exists, outputs will write into this folder 04-15 19:51:27, INFO voxel_size 1.399999976158142 04-15 19:51:32, INFO spIsoNet correction until resolution 10.0A! Information beyond 10.0A remains unchanged 04-15 19:54:34, INFO Start preparing subvolumes! 04-15 19:54:59, INFO Done preparing subvolumes! 04-15 19:54:59, INFO Start training! 04-15 19:55:02, INFO Port number: 45495 [W socket.cpp:464] [c10d] The server socket cannot be initialized on [::]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). [W socket.cpp:697] [c10d] The client socket cannot be initialized to connect to [localhost]:45495 (errno: 97 - Address family not supported by protocol). Traceback (most recent call last): File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/bin/spisonet.py", line 8, in
sys.exit(main())
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 549, in main
fire.Fire(ISONET)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/spisonet.py", line 182, in reconstruct
map_refine_n2n(halfmap1,halfmap2, mask_vol, fsc3d, alpha = alpha,beta=beta, voxel_size=voxel_size, output_dir=output_dir,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/bin/map_refine.py", line 145, in map_refine_n2n
network.train([data_dir_1,data_dir_2], output_dir, alpha=alpha,beta=beta, output_base=output_base0, batch_size=batch_size, epochs = epochs, steps_per_epoch = 1000,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 265, in train
mp.spawn(ddp_train, args=(self.world_size, self.port_number, self.model,alpha,beta,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 241, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 197, in start_processes
while not context.join():
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 158, in join
raise ProcessRaisedException(msg, error_index, failed_process.pid)
torch.multiprocessing.spawn.ProcessRaisedException:
-- Process 1 terminated with the following error: Traceback (most recent call last): File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/multiprocessing/spawn.py", line 68, in _wrap fn(i, *args) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/spIsoNet/models/network_n2n.py", line 50, in ddp_train model = model.cuda() File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in cuda return self._apply(lambda t: t.cuda(device)) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 802, in _apply module._apply(fn) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 825, in _apply param_applied = fn(param) File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/torch/nn/modules/module.py", line 911, in
return self._apply(lambda t: t.cuda(device))
RuntimeError: CUDA error: CUDA-capable device(s) is/are busy or unavailable
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with
TORCH_USE_CUDA_DSA
to enable device-side assertions.Traceback (most recent call last): File "/home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py", line 517, in
with mrcfile.open(mrc1_cor) as d1:
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/load_functions.py", line 139, in open
return NewMrc(name, mode=mode, permissive=permissive,
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 109, in init
self._open_file(name)
File "/home/groups/kornberg/donghuac/anaconda3/envs/spisonet/lib/python3.10/site-packages/mrcfile/mrcfile.py", line 126, in _open_file
self._iostream = open(name, self._mode + 'b')
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job025/corrected_run_it001_half1_class001_unfil.mrc'
in: /home/groups/kornberg/donghuac/relion/src/backprojector.cpp, line 1323
ERROR:
ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4e6a59]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x46434e]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x1b02) [0x5222b2]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x3e9) [0x523279]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi(main+0x55) [0x4d47b5]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff6f5fbb555]
/home/groups/kornberg/donghuac/relion/bin/relion_refine_mpi() [0x4d805e]
ERROR: ERROR: there was something wrong with system call: python /home/groups/kornberg/donghuac/spIsoNet/spIsoNet/bin/relion_wrapper.py Refine3D/job025/run_it001_half1_class001_external_reconstruct.star
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. You may or may not see output from other processes, depending on exactly when Open MPI kills them.