Open wlugmayr opened 4 months ago
Hi
I wonder whether this error will appear when you add --solvent_correct_fsc into the command
Hi,
yes now it comes to iteration 5..
srun --mpi=pmi2 which relion_refine_mpi
--o Refine3D/job001/run --auto_refine --split_random_halves --i job025_tutorial.star --ref HA_reference.mrc --firstiter_cc --ini_high 10 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --particle_diameter 170 --flatten_solvent --zero_mask --solvent_mask mask.mrc --oversampling 1 --healpix_order 2 --auto_local_healpix_order 3 --offset_range 5 --offset_step 2 --sym C3 --low_resol_join_halves 40 --norm --scale --j 1 --gpu "" --external_reconstruct --keep_lowres --solvent_correct_fsc --pipeline_control Refine3D/job001/
File "/gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 4; expected version 3 instead. Hint: enable an omaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job001/corrected_run_it005_half1_class001_unfil.mrc' in: /gpfs/cssb/software/tmp/install/relion-4.0.1/src/backprojector.cpp, line 1294
I installed torch like: pip install torch --index-url https://download.pytorch.org/whl/cu118
Hi,
I have also experienced this problem. This is because data have to pass through the same network more than once. I do not know exact solution to it now. What I current experience is the following, (probably not correct):
Yes with multiple GPUs it is working now. At the beginning I did not specify CUDA_VISIBLE_DEVICES and got an error. So I set it to: CUDA_VISIBLE_DEVICES=0 But when is set them now e.g. 4 GPU node to CUDA_VISIBLE_DEVICES=0 1 2 3 the program is running without error to the end in Relion4 & Relion5 (for both I used the full path to python to avoid clashes with the relion5 conda dependencies) - environment modules style:
setenv RELION_EXTERNAL_RECONSTRUCT_EXECUTABLE {/gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/spIsoNet/bin/relion_wrapper.py} setenv CONDA_ENV spisonet-1.0.0 setenv CUDA_VISIBLE_DEVICES {0 1 2 3}
Why do you write in your documentation that spIsoNet does not work with Relion5? Is the output mrc wrong?
If you can run through relion5 it should be totally great. Saying the spIsoNet does not work for relion5 is because of the clashing of the conda environment or blush. It would be great if you can share the details on what environment need to be set for relion5. whether it need to deactivate conda for relion5 and use spisonet's instead?
Well the solution is quite simple:
The trick is to provide the full path to the python executable to spIsoNet. Here some tests:
$ which python /gpfs/cssb/software/rhel9/anaconda3/envs/relionconda-5.0.1/bin/python $ /gpfs/cssb/software/rhel9/anaconda3/envs/relionconda-5.0.1/bin/python -m pip list | grep blush relion-blush 0.0.1 $ /gpfs/cssb/software/rhel9/anaconda3/envs/relionconda-5.0.1/bin/python -m pip list | grep spisonet
$ /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python -m pip list | grep blush $ /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python -m pip list | grep spisonet spIsoNet 1.0
The dedicated python executable knows its packages so there should be no clashes between different conda environments. For the spIsoNet wrapper you do not have to activate the spIsoNet conda.
So instead of setting (which will end up in using the Relion5 python): export RELION_EXTERNAL_RECONSTRUCT_EXECUTABLE='python /fullpath_to_spisonet_wrapper/relion_wrapper.py' you set: export RELION_EXTERNAL_RECONSTRUCT_EXECUTABLE=' /fullpath_to_spisonet_python/python /fullpath_to_spisonet_wrapper/relion_wrapper.py'
In the Relion Gui I have set Reference -> Use Blush regularisation? -> No and the job runs technically to the end generating an mrc output file.
here is my commandline:
srun --mpi=pmi2
which relion_refine_mpi
--o Refine3D/job001/run --auto_refine --split_random_halves --i job025_tutorial.star --ref HA_reference.mrc --firstiter_cc --ini_high 10 --dont_combine_weights_via_disc --pool 3 --pad 2 --ctf --particle_diameter 170 --flatten_solvent --zero_mask --solvent_mask mask.mrc --oversampling 1 --healpix_order 2 --auto_local_healpix_order 3 --offset_range 5 --offset_step 2 --sym C3 --low_resol_join_halves 40 --norm --scale --j 1 --gpu "" --external_reconstruct --keep_lowres --pipeline_control Refine3D/job001/here is parts of the run.out
Expectation iteration 1 7.45/7.43 min ............................................................~~(,_,"> Averaging half-reconstructions up to 40 Angstrom resolution to prevent diverging orientations ... Note that only for higher resolutions the FSC-values are according to the gold-standard! Calculating gold-standard FSC ... Maximization ...
Making system call for external reconstruction: /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/spIsoNet/bin/relion_wrapper.py Refine3D/job001/run_it001_half1_class001_external_reconstruct.star iter = 001 set CUDA_VISIBLE_DEVICES=0 set CONDA_ENV=spisonet-1.0.0 set ISONET_WHITENING=True set ISONET_WHITENING_LOW=10 set ISONET_RETRAIN_EACH_ITER=True set ISONET_BETA=0.5 set ISONET_ALPHA=1 set ISONET_START_HEALPIX=3 set ISONET_ACC_BATCHES=2 set ISONET_EPOCHS=5 set ISONET_KEEP_LOWRES=False set ISONET_LOWPASS=True set ISONET_ANGULAR_WHITEN=False set ISONET_3DFSD=False set ISONET_FSC_05=False set ISONET_FSC_WEIGHTING=True set ISONET_START_RESOLUTION=15.0 set ISONET_KEEP_LOWRES= True healpix = 2 symmetry = C3 mask_file = mask.mrc pixel size = 1.309998 resolution at 0.5 and 0.143 are 999.0 and 999.0 real limit resolution to 10.0
RELION version: 4.0.1 exiting with an error ...
and here the run.err:
The following warnings were encountered upon command-line parsing: WARNING: Option --keep_lowres is not a valid RELION argument Traceback (most recent call last): File "/gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/spIsoNet/bin/relion_wrapper.py", line 362, in
shutil.copy(mrc_unfil, mrc_unfil_backup)
File "/gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/shutil.py", line 417, in copy
copyfile(src, dst, follow_symlinks=follow_symlinks)
File "/gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/shutil.py", line 254, in copyfile
with open(src, 'rb') as fsrc:
FileNotFoundError: [Errno 2] No such file or directory: 'Refine3D/job001/run_it001_half1_class001_unfil.mrc'
in: /gpfs/cssb/software/tmp/install/relion-4.0.1/src/backprojector.cpp, line 1294
ERROR:
ERROR: there was something wrong with system call: /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/spIsoNet/bin/relion_wrapper.py Refine3D/job001/run_it001_half1_class001_external_reconstruct.star
=== Backtrace ===
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x69) [0x4c7eb9]
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi() [0x44f710]
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi12maximizationEv+0x17dc) [0x4ffb4c]
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x482) [0x500b52]
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi(main+0x59) [0x4b6a49]
/lib64/libc.so.6(+0x3feb0) [0x14a1bf43feb0]
/lib64/libc.so.6(__libc_start_main+0x80) [0x14a1bf43ff60]
/gpfs/cssb/software/rhel9/x86_64/relion/4.0.1/bin/relion_refine_mpi(_start+0x25) [0x4b9ba5]
ERROR: ERROR: there was something wrong with system call: /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/bin/python /gpfs/cssb/software/rhel9/anaconda3/envs/spisonet-1.0.0/lib/python3.10/site-packages/spIsoNet/bin/relion_wrapper.py Refine3D/job001/run_it001_half1_class001_external_reconstruct.star
$ find Refine3D
Refine3D Refine3D/job001 Refine3D/job001/run_it000_half2_class001_angdist.bild Refine3D/job001/run.err Refine3D/job001/default_pipeline.star Refine3D/job001/run_it001_half2_class001_external_reconstruct.star Refine3D/job001/run_it000_sampling.star Refine3D/job001/run_it001_half1_class001_external_reconstruct_data_real.mrc Refine3D/job001/run_it001_half2_class001_external_reconstruct_data_real.mrc Refine3D/job001/run.out Refine3D/job001/run_it001_half2_class001_external_reconstruct_weight.mrc Refine3D/job001/run_it000_half1_model.star Refine3D/job001/run_it000_optimiser.star Refine3D/job001/run_it000_half1_class001.mrc Refine3D/job001/run_it001_half1_class001_external_reconstruct.star Refine3D/job001/run_it000_half2_class001.mrc Refine3D/job001/.run.err.tail Refine3D/job001/.run.out.tail Refine3D/job001/run_submit.script Refine3D/job001/job_pipeline.star Refine3D/job001/job.star Refine3D/job001/run_it000_half2_model.star Refine3D/job001/run_it001_half1_class001_external_reconstruct_data_imag.mrc Refine3D/job001/run_it000_data.star Refine3D/job001/run_it001_half2_class001_external_reconstruct_data_imag.mrc Refine3D/job001/run_it001_half1_class001_external_reconstruct_weight.mrc Refine3D/job001/note.txt Refine3D/job001/run_it000_half1_class001_angdist.bild Refine3D/job001/RELION_JOB_EXIT_FAILURE Refine3D/job001/run_it001_half1_class001_external_reconstruct.mrc Refine3D/job001/run_it001_half2_class001_external_reconstruct.mrc Refine3D/spisonet