Copy to scratch - Githubissues

frozenfas commented 3 years ago

Hello, We are having a strange problem with a 3DRefine which I cannot get to the source of. We running relion 4.0 (RELION version: 4.0-beta-1-commit-39b2fb Precision: BASE=double) and are running this command:

Environment:

OS: Ubuntu 20.04
(RELION version: 4.0-beta-1-commit-39b2fb
Memory: 256 GB
GPU: GTX 1080

Dataset:

Box size: 284 px
Pixel size: 1.4 Å/px
Number of particles: approx 1.3x106

Job options:

Type of job: Refine3D
Number of MPI processes: 5
Number of threads: 6

`which relion_refine_mpi` --o Refine3D/job091/run --auto_refine --split_random_halves --i Extract/job085/particles.star --ref new_run_it025_class001.mrc --ini_high 40 --dont_combine_weights_via_disc 
--scratch_dir /scratch/data-xhan/ --pool 30 --pad 2  --skip_gridding  --auto_ignore_angles --auto_resol_angles --ctf --particle_diameter 250 --flatten_solvent --zero_mask --oversampling 1 --healpix_o
rder 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym C1 --low_resol_join_halves 40 --norm --scale  --j 6 --gpu ""  --pipeline_control Refine3D/job091/

And we get this error message

No protocol specified

 fn_img= 593201@/scratch/data-xhan/relion_volatile/opticsgroup2_particles.mrcs
 img_id= 0 adaptive_fraction= 0.999
 min_diff2= 3.40282e+38
Dumped data: error_dump_pdf_orientation, error_dump_pdf_orientation and error_dump_unsorted.
in: /home/xhan/local/relion/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

in: /home/xhan/local/relion/src/acc/acc_ml_optimiser_impl.h, line 1858
ERROR: 
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

follower 1 encountered error: === Backtrace  ===
/home/xhan/local/relion-4.0/bin/relion_refine_mpi(_ZN11RelionErrorC1ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x7b) [0x560cc1c93adb]
/home/xhan/local/relion-4.0/bin/relion_refine_mpi(+0x78d60) [0x560cc1c76d60]
/home/xhan/local/relion-4.0/bin/relion_refine_mpi(+0x283a1d) [0x560cc1e81a1d]
/lib/x86_64-linux-gnu/libgomp.so.1(+0x1a78e) [0x7f096016178e]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x9609) [0x7f09604d6609]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7f096005a293]
==================
ERROR: 
No orientation was found as better than any other.

A particle image was compared to the reference and resulted in all-zero
weights (for all orientations). This should not happen, unless your data
has very special characteristics. This has historically happened for some 
lower-precision calculations, but multiple fallbacks have since been 
implemented. Please report this error to the relion developers at 

             github.com/3dem/relion/issues  

--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

If I view the indicated image it is blank and relion_image_handler reports stats consistent with this (i.e. min/max value =0). The original image in the Extract directory is fine. The weird thing is that I realized the scratch directory was 100% full although there was no indication in run.log that there was an issue:

Running CPU instructions in double precision. 
 + On host slinky: free scratch space = 438.547 Gb.
 Copying particles to scratch directory: /scratch/data-xhan/relion_volatile/
000/??? sec ~~(,_,">                                                          [oo]

 For optics_group 1, there are 793035 particles on the scratch disk.
 For optics_group 2, there are 598009 particles on the scratch disk.
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [oo]

The funny thing is that the free scratch space = 438.547 Gb but if I "du -sh Extract/job085" I see that it is 211 Gb so I should not be running out of space. Using mrcfile I checked the header of the original images and those copied to the scratch and noticed the mode=12 (float16) as expected in the Extract directory but the ones in the scratch directory have mode=2 (float32). This would double the size of the stack in the scratch right and explain why I am running out of space right? Is it necessary to have the float32 images in the stack or can I specify to use float16 there as well?

biochem-fan commented 3 years ago

Unfortunately the current version of RELION always writes scratch in float32, in case the input dataset is a mixture of float16 and float32 particles. Clearly this is suboptimal but the fix has to wait more fundamental refactoring of the scratch system.

biochem-fan commented 3 years ago

The program should have stopped copying when the free space went below 10 GB ( --keep_free_scratch 10 is the default). I don't know why it didn't work.

Note that this code assumes that (1) no other programs are writing to the disk simultaneously and (2) all particles have the same dimensions. Again, fixing this will be part of the scratch system refactoring.

frozenfas commented 3 years ago

Thanks so much for responding so quickly biochem-fan. I am also not sure why it did not leave the remaining particles in place. I don't think anything else was running but I will try again to see if it is reproducible.

I guess another short term workaround will be to manually copy the input stacks to scratch and edit the star file to point to them?

biochem-fan commented 3 years ago

I guess another short term workaround will be to manually copy the input stacks to scratch and edit the star file to point to them?

Yes. This topic was discussed in CCPEM recently.

frozenfas commented 3 years ago

I remember thanks for the help/clarifications today :)

3dem / relion

Copy to scratch #820