2D class exited with error on Relion/4.0-beta-2

forhadsaikot commented 2 years ago

This is a template for reporting bugs. Please fill in as much information as you can.

Describe your problem

I was trying to process some data collected collected with EPU_Group_AFIS strategy on a Falcon 4 detector in EER format. I have created the optics_group.star file and used that for motioncorr and ctf. I was looking to run particle picking with Topaz integrated now to Relion/4. According to the tutorial, I just selected few micrographs to autopick from, then extracted and subsequent 2D classification. I had tried this with the dataset on Relion/3.1.3 without any problem. I was trying Relion/4 for better picking experience. Suggestions would be highly appreciated.

Environment:

OS: [Centos Linux 7]
MPI runtime: [openmpi-x86_64]
RELION version [RELION-4.0-beta-2-commit-9b23e5]
Memory: [503.5 GB]
GPU: [RTX A6000]

Dataset:

Box size: [450 px]
Pixel size: [e.g. 0.878 Å/px]
Partcles: 2145

Job options:

Type of job: [2D classification]
Number of MPI processes: [3]
Number of threads: [8]

Full command (see note.txt in the job directory):

`which relion_refine_mpi` --o Class2D/job007/run --iter 25 --i Extract/job005/particles.star --dont_combine_weights_via_disc --preread_images  --pool 30 --pad 2  --ctf  --tau2_fudge 2 --particle_diameter 220 --K 50 --flatten_solvent  --zero_mask  --center_classes  --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale  --j 8 --gpu ""  --pipeline_control Class2D/job007/

Error message:

Please cite the full error message as the example below.

in: /data/uqimorti/rpmbuild/BUILD/relion-4.0/src/acc/cuda/cuda_fft.h, line 133
ERROR: 
ERROR: 
You are changing the dimension of a CUFFT-transform (plan)
This is a developer error message which you cannot fix 
through changing the run config. Either your data is broken or
an unforseen combination of options was encountered. Please report
this error, the command used and a brief description to
the relion developers at 

 github.com/3dem/relion/issues 

follower 2 encountered error: === Backtrace  ===
/usr/local/relion/4.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKSsS1_l+0x41) [0x4534b1]
/usr/local/relion/4.0/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesPvi+0x84) [0x62d134]
/usr/local/relion/4.0/bin/relion_refine_mpi() [0x62d1d5]
/lib64/libgomp.so.1(+0x16405) [0x7f9e2ff16405]
/lib64/libpthread.so.0(+0x7ea5) [0x7f9e3093fea5]
/lib64/libc.so.6(clone+0x6d) [0x7f9e2fa1ab0d]
==================
ERROR: 
You are changing the dimension of a CUFFT-transform (plan)
This is a developer error message which you cannot fix 
through changing the run config. Either your data is broken or
an unforseen combination of options was encountered. Please report
this error, the command used and a brief description to
the relion developers at 

 github.com/3dem/relion/issues 

==================

biochem-fan commented 2 years ago

(I fixed your Markdown markup)

Hmm, this is a very unusual crash. I have never seen this.

As a sanity check of your particles, can you run Class2D without GPU?

Box size: [450 px] Pixel size: [e.g. 0.878 Å/px]

This is very inefficient. For Class2D, you can (and should) down-sample to ~4 Å/px. I would use 128 px in this case, which would give 4.12 Å/px.

forhadsaikot commented 2 years ago

Hi, Thanks for your reply. 450 px was typo sorry. I ran with a particle box size of 256 actually during extraction. I am trying with 128 now.

2D classification (with particle extracted using 256 pix box size) without GPU stuck with the following error:

[scmb-9x1vsk3:200116] Process received signal [scmb-9x1vsk3:200116] Signal: Aborted (6) [scmb-9x1vsk3:200116] Signal code: (-6) [scmb-9x1vsk3:200116] [ 0] /lib64/libpthread.so.0(+0xf630)[0x7ff7478b0630] [scmb-9x1vsk3:200116] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7ff7468bb387] [scmb-9x1vsk3:200116] [ 2] /lib64/libc.so.6(abort+0x148)[0x7ff7468bca78] [scmb-9x1vsk3:200116] [ 3] /lib64/libc.so.6(+0x78f67)[0x7ff7468fdf67] [scmb-9x1vsk3:200116] [ 4] /lib64/libc.so.6(+0x7f474)[0x7ff746904474] [scmb-9x1vsk3:200116] [ 5] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN9Projector14initialiseDataEi+0x7d9)[0x49c9c9] [scmb-9x1vsk3:200116] [ 6] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN9Projector9initZerosEi+0x9)[0x49d4b9] [scmb-9x1vsk3:200116] [ 7] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN9Projector26computeFourierTransformMapER13MultidimArrayIdES2_iibbiPKS1_b+0x88c)[0x49dd5c] [scmb-9x1vsk3:200116] [ 8] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN7MlModel23setFourierTransformMapsEbidPK13MultidimArrayIdE+0x743)[0x5cb9a3] [scmb-9x1vsk3:200116] [ 9] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN11MlOptimiser16expectationSetupEv+0x5e)[0x5e6f5e] [scmb-9x1vsk3:200116] [10] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi11expectationEv+0x45f)[0x4725cf] [scmb-9x1vsk3:200116] [11] /usr/local/relion/4.0/bin/relion_refine_mpi(_ZN14MlOptimiserMpi7iterateEv+0x1a3)[0x47ecd3] [scmb-9x1vsk3:200116] [12] /usr/local/relion/4.0/bin/relion_refine_mpi(main+0x5f)[0x43deaf] [scmb-9x1vsk3:200116] [13] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7ff7468a7555] [scmb-9x1vsk3:200116] [14] /usr/local/relion/4.0/bin/relion_refine_mpi[0x441b5e] [scmb-9x1vsk3:200116] End of error message

Is it any issue with our workstation? Please comment. I'm novice in this field, and stuck with this dataset collected with epu group afis strategy for a while. After I created the optics_group .star file, I ran the following commands for motioncorr and ctffind, respectively:

which relion_run_motioncorr_mpi --i movies.star --o MotionCorr/job001/ --first_frame_sum 1 --last_frame_sum -1 --use_own --j 1 --bin_factor 1 --bfactor 150 --dose_per_frame 0.887 --preexposure 0 --patch_x 5 --patch_y 5 --eer_grouping 36 --gainref 20220215_155605_EER_GainReference.gain --gain_rot 0 --gain_flip 0 --dose_weighting --save_noDW --grouping_for_ps 5 --pipeline_control MotionCorr/job001/

which relion_run_ctffind_mpi --i MotionCorr/job001/corrected_micrographs.star --o CtfFind/job002/ --Box 512 --ResMin 30 --ResMax 5 --dFMin 5000 --dFMax 50000 --FStep 500 --dAst 100 --use_noDW --ctffind_exe ../../../../usr/local/bin/ctffind --ctfWin -1 --is_ctffind4 --fast_search --use_given_ps --pipeline_control CtfFind/job002/

Could you kindly comment if I'm doing this properly? I really need help.

Many thanks, Forhad.

***Okay, so after extracting with 128 pix box, I ran 2D classification again with and without GPU. Stopped after 1st iteration with the similar errors that I have reported.

biochem-fan commented 2 years ago

I have created the optics_group.star file

How did you make it? I suspect you made some mistakes here.

As long as AFIS is calibrated properly, you don't have to group micrographs by beam shift directions. Can you try without grouping? That is, import all movies by the Import job and use movies.star as is.

--save_noDW

This is waste of time; use power spectra from motion correction instead.

I'm novice in this field

Did you go through RELION 4.0 tutorial? Even if you have used RELION 3.1 before, please re-do RELION 4.0's tutorial because it explains new features and new recommendations. If you can complete the tutorial, your machine is probably OK.

forhadsaikot commented 2 years ago

How did you make it? I suspect you made some mistakes here.

I made it using python scripts written by Dustin Morado (https://github.com/DustinMorado/EPU_group_AFIS). People from where we collected this data do this regularly, so I believe the AFIS may not be calibrated (not sure though). I contacted Dustin in this regard, he confirmed movies.star with beam shift info was written correctly.

_This is waste of time_use power spectra from motion correction instead

Thanks for that, will do this next time.

Did you go through RELION 4.0 tutorial?

Yes, in fact I was following this step by step, when I started getting this error :( However, I did this steps with Relion/3 without any issue. Particle picking was pretty awful, so I decided I would use the integrated topaz picking with Relion/4, and stuck :(

What do you recommend next? Thank you for you support.

biochem-fan commented 2 years ago

Hmm, I have no idea what is happening. Both GPU and CPU versions are failing when setting up FFT. So my first guess is an inconsistent entry in the optics group table, such as the box size and image dimension. But the fact that the same STAR file worked fine in RELION 3.1 suggests something else...

People from where we collected this data do this regularly,

This does not necessarily mean AFIS calibration is bad. Ask them if they actually see different beam tilts per group.

forhadsaikot commented 2 years ago

But the fact that the same STAR file worked fine in RELION 3.1 suggests something else...

I didn't save the non-dose weighted on Relion/3. Also, how important is it to save the sum of power spectra during motioncorr?

Ask them if they actually see different beam tilts per group.

Sure thing. They just told me, the file names don't have any information to help correct higher order optics aberration, that's why they supplied the additional xml files so we can treat each beam shift group separately and I then wrote the optics_group.star file. It was collected on a Glacios with Falcon 4.

biochem-fan commented 2 years ago

I didn't save the non-dose weighted on Relion/3.

Good. Don't save in RELION 4, too.

Also, how important is it to save the sum of power spectra during motioncorr?

This does not make a big difference, but it makes CTFFIND faster.

forhadsaikot commented 2 years ago

Good. Don't save in RELION 4, too

Thanks, I tried. But that error persists.

These were my corresponding job commands. Please advise if I'm doing anything wrong?

Select: which relion_star_handler --i CtfFind/job016/micrographs_ctf.star --o Select/job017/micrographs.star --split --size_split 20 --pipeline_control Select/job017/

Autopick: which relion_autopick_mpi --i Select/job017/micrographs_split1.star --odir AutoPick/job018/ --pickname autopick --LoG --LoG_diam_min 180 --LoG_diam_max 200 --shrink 0 --lowpass 20 --LoG_adjust_threshold 0 --LoG_upper_threshold 1.5 --pipeline_control AutoPick/job018/

Extract: which relion_preprocess_mpi --i CtfFind/job016/micrographs_ctf.star --coord_list AutoPick/job018/autopick.star --part_star Extract/job019/particles.star --part_dir Extract/job019/ --extract --extract_size 128 --norm --bg_radius 48 --white_dust -1 --black_dust -1 --invert_contrast --pipeline_control Extract/job019/

Class 2D: which relion_refine_mpi --o Class2D/job020/run --iter 25 --i Extract/job019/particles.star --dont_combine_weights_via_disc --preread_images --pool 30 --pad 2 --ctf --tau2_fudge 2 --particle_diameter 220 --K 50 --flatten_solvent --zero_mask --center_classes --oversampling 1 --psi_step 12 --offset_range 5 --offset_step 2 --norm --scale --j 8 --gpu "" --pipeline_control Class2D/job020/

biochem-fan commented 2 years ago

Yes, in fact I was following this step by step, when I started getting this error

Are you following the tutorial with the tutorial data (beta-galactosidase) or your own data? If latter, please first test the beta-galactosidase.

biochem-fan commented 10 months ago

No response for many months. Closing.

3dem / relion

2D class exited with error on Relion/4.0-beta-2 #889