3D auto-refine: mpirun exited on signal 11 (Segmentation fault)

bforsbe commented 8 years ago

Originally reported by: Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown)

Hi!

It seems that my problems with MPI are still not over. Now I have a segmentation fault when I try to run a 3D auto-refine job. I have attached a screenshot for you to see, in case it helps.

I am probably completely wrong but, since I am using a single GTX 980 with only 4GB or RAM, could it be that the program is pointing to a position in the memory that does not exist?

Thanks!

Alberto

Bitbucket: https://bitbucket.org/tcblab/relion2-beta/issue/59

bforsbe commented 7 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Hi Bjoern,

It is working fine in my hands now :)

bforsbe commented 7 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

@iiciieii Is this issue resolvable or is there anything left of it?

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

The great news is that everything is documented now :)

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

I think it might be the same as issue #53, which I think I just found the solution to. It was just tricky to get a reasonable indication about where the error was encountered. This is often the case for mpirun, since there are multiple processes in flight at the same time, which can kill and mute each other based on kill-signals in very messy ways.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Working now :) fingers crossed!

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

try --gpu 0:0 (note the colon, not comma)

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Ups! I used the command now (I hope it is fine this time!)

#!c++

[iiciieii@rekiem-t5400 betagal]$ mpirun --mca orta_base_help_aggregate 0 -n 3 relion_refine_mpi --o Refine3D/job071/run --auto_refine --split_random_halves --i Select/Selected_3d_classes/particles.star --ref 3i3e_lp50A.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 3 --gpu 0
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Number of threads per MPI process  = 3
 + Total number of threads therefore  = 9
 + Master  (0) runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     1 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     2 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 =================
 Running CPU instructions in double precision. 
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [o   2/   2 sec ...........................................................~~(,_,"   2/   2 sec ............................................................~~(,_,">
 uniqueHost rekiem-t5400.cscdom.csc.mrc.ac.uk has 2 ranks.
 Slave 1 will distribute threads over devices  0
 Thread 0 on slave 1 mapped to device 0
 Thread 1 on slave 1 mapped to device 0
 Thread 2 on slave 1 mapped to device 0
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 17530 on node rekiem-t5400 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Unless you use the command I provided before, the errors will not show up, so please use that to generate output to post.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

I was checking that and fixed the problem, but still does not run :/ This time used 2 MPI process and 8 threads, but it is the same independently of the number of threads I decide to use.

#!c++

`which relion_refine_mpi` --o Refine3D/job070/run --auto_refine --split_random_halves --i Select/2d_classes_after_big_extraction/particles.star --ref Class3D/second_3d_clasification/run_it002_class004.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 4 --gpu 0

#!c++

=== RELION MPI setup ===
 + Number of MPI processes             = 2
 + Number of threads per MPI process  = 4
 + Total number of threads therefore  = 8
 + Master  (0) runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     1 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 =================
 Running CPU instructions in double precision. 
+++ RELION: command line arguments (with defaults for optional ones between parantheses) +++
====== General options ===== 
                                --i : Input images (in a star-file or a stack)
                                --o : Output rootname
                        --iter (50) : Maximum number of iterations to perform
                      --angpix (-1) : Pixel size (in Angstroms)
                   --tau2_fudge (1) : Regularisation parameter (values higher than 1 give more weight to the data)
                            --K (1) : Number of references to be refined
           --particle_diameter (-1) : Diameter of the circular mask that will be applied to the experimental images (in Angstroms)
                --zero_mask (false) : Mask surrounding background in particles to zero (by default the solvent area is filled with random noise)
          --flatten_solvent (false) : Perform masking on the references as well?
              --solvent_mask (None) : User-provided mask for the references (default is to use spherical mask with particle_diameter)
             --solvent_mask2 (None) : User-provided secondary mask (with its own average density)
               --multibody_masks () : STAR file with binary masks for multi-body refinement
                       --tau (None) : STAR file with input tau2-spectrum (to be kept constant)
      --split_random_halves (false) : Refine two random halves of the data completely separately
       --low_resol_join_halves (-1) : Resolution (in Angstrom) up to which the two random half-reconstructions will not be independent to prevent diverging orientations
====== Initialisation ===== 
                       --ref (None) : Image, stack or star-file with the reference(s). (Compulsory for 3D refinement!)
                       --offset (3) : Initial estimated stddev for the origin offsets
             --firstiter_cc (false) : Perform CC-calculation in the first iteration (use this if references are not on the absolute intensity scale)
                    --ini_high (-1) : Resolution (in Angstroms) to which to limit refinement in the first iteration 
====== Orientations ===== 
                 --oversampling (1) : Adaptive oversampling order to speed-up calculations (0=no oversampling, 1=2x, 2=4x, etc)
                --healpix_order (2) : Healpix order for the angular sampling (before oversampling) on the (3D) sphere: hp2=15deg, hp3=7.5deg, etc
                    --psi_step (-1) : Sampling rate (before oversampling) for the in-plane angle (default=10deg for 2D, hp sampling for 3D)
                 --limit_tilt (-91) : Limited tilt angle: positive for keeping side views, negative for keeping top views
                         --sym (c1) : Symmetry group
                 --offset_range (6) : Search range for origin offsets (in pixels)
                  --offset_step (2) : Sampling rate (before oversampling) for origin offsets (in pixels)
         --helical_offset_step (-1) : Sampling rate (before oversampling) for offsets along helical axis (in pixels)
                    --perturb (0.5) : Perturbation factor for the angular sampling (0=no perturb; 0.5=perturb)
              --auto_refine (false) : Perform 3D auto-refine procedure?
     --auto_local_healpix_order (4) : Minimum healpix order (before oversampling) from which autosampling procedure will use local searches
                   --sigma_ang (-1) : Stddev on all three Euler angles for local angular searches (of +/- 3 stddev)
                   --sigma_rot (-1) : Stddev on the first Euler angle for local angular searches (of +/- 3 stddev)
                  --sigma_tilt (-1) : Stddev on the second Euler angle for local angular searches (of +/- 3 stddev)
                   --sigma_psi (-1) : Stddev on the in-plane angle for local angular searches (of +/- 3 stddev)
               --skip_align (false) : Skip orientational assignment (only classify)?
              --skip_rotate (false) : Skip rotational assignment (only translate and classify)?
              --bimodal_psi (false) : Do bimodal searches of psi angle?
====== Helical symmetry (in development...) ===== 
                    --helix (false) : Perform 3D classification or refinement for helices?
               --helical_nr_asu (1) : Number of new helical asymmetric units (asu) per box (1 means no helical symmetry is present)
       --helical_twist_initial (0.) : Helical twist (in degrees, positive values for right-handedness)
           --helical_twist_min (0.) : Minimum helical twist (in degrees, positive values for right-handedness)
           --helical_twist_max (0.) : Maximum helical twist (in degrees, positive values for right-handedness)
       --helical_twist_inistep (0.) : Initial step of helical twist search (in degrees)
        --helical_rise_initial (0.) : Helical rise (in Angstroms)
            --helical_rise_min (0.) : Minimum helical rise (in Angstroms)
            --helical_rise_max (0.) : Maximum helical rise (in Angstroms)
        --helical_rise_inistep (0.) : Initial step of helical rise search (in Angstroms)
       --helical_z_percentage (0.3) : This box length along the center of Z axis contains good information of the helix. Important in imposing and refining symmetry
     --helical_inner_diameter (-1.) : Inner diameter of helical tubes in Angstroms (for masks of helical references and particles)
     --helical_outer_diameter (-1.) : Outer diameter of helical tubes in Angstroms (for masks of helical references and particles)
  --helical_symmetry_search (false) : Perform local refinement of helical symmetry?
     --helical_sigma_distance (-1.) : Sigma of distance along the helical tracks
====== Corrections ===== 
                      --ctf (false) : Perform CTF correction?
    --ctf_intact_first_peak (false) : Ignore CTFs until their first peak?
        --ctf_corrected_ref (false) : Have the input references been CTF-amplitude corrected?
        --ctf_phase_flipped (false) : Have the data been CTF phase-flipped?
           --ctf_multiplied (false) : Have the data been premultiplied with their CTF?
         --only_flip_phases (false) : Only perform CTF phase-flipping? (default is full amplitude-correction)
                     --norm (false) : Perform normalisation-error correction?
                    --scale (false) : Perform intensity-scale corrections on image groups?
====== Computation ===== 
                         --pool (1) : Number of images to pool for each thread task
                            --j (1) : Number of threads to run in parallel (only useful on multi-core machines)
  --dont_combine_weights_via_disc (false) : Send the large arrays of summed weights through the MPI network, instead of writing large files to disc
          --onthefly_shifts (false) : Calculate shifted images on-the-fly, do not store precalculated ones in memory
      --no_parallel_disc_io (false) : Do NOT let parallel (MPI) processes access the disc simultaneously (use this option with NFS)
           --preread_images (false) : Use this to let the master process read all particles into memory. Be careful you have enough RAM for large data sets!
                   --scratch_dir () : If provided, particle stacks will be copied to this local scratch disk prior to refinement.
           --keep_free_scratch (10) : Space available for copying particle stacks (in Gb)
                      --gpu (false) : Use available gpu resources for some calculations
              --free_gpu_memory (0) : GPU device memory (in Mb) to leave free after allocation.
====== Expert options ===== 
                          --pad (2) : Oversampling factor for the Fourier transforms of the references
                       --NN (false) : Perform nearest-neighbour instead of linear Fourier-space interpolation?
                    --r_min_nn (10) : Minimum number of Fourier shells to perform linear Fourier-space interpolation
                         --verb (1) : Verbosity (1=normal, 0=silent)
                 --random_seed (-1) : Number for the random seed generator
                 --coarse_size (-1) : Maximum image size for the first pass of the adaptive sampling approach
        --adaptive_fraction (0.999) : Fraction of the weights to be considered in the first pass of adaptive oversampling 
                     --maskedge (5) : Width of the soft edge of the spherical mask (in pixels)
          --fix_sigma_noise (false) : Fix the experimental noise spectra?
         --fix_sigma_offset (false) : Fix the stddev in the origin offsets?
                   --incr_size (10) : Number of Fourier shells beyond the current resolution to be included in refinement
    --print_metadata_labels (false) : Print a table with definitions of all metadata labels, and exit
       --print_symmetry_ops (false) : Print all symmetry transformation matrices, and exit
          --strict_highres_exp (-1) : Resolution limit (in Angstrom) to restrict probability calculations in the expectation step
          --dont_check_norm (false) : Skip the check whether the images are normalised correctly
                --always_cc (false) : Perform CC-calculation in all iterations (useful for faster denovo model generation?)
      --solvent_correct_fsc (false) : Correct FSC curve for the effects of the solvent mask?
====== MPI options ===== 
  --only_do_unfinished_movies (false) : When processing movies on a per-micrograph basis, ignore those movies for which the output STAR file already exists.
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

"Experiment::write: Cannot write file: Refine3D/job069/run_test_it000_data.star"

Best guess is that you don't have write-access to the output location, or that it does not exist.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

This is what I have:

#!c++

[iiciieii@rekiem-t5400 betagal]$ mpirun --mca orte_base_help_aggregate 0 -n 3 relion_refine_mpi --o Refine3D/job069/run_test --auto_refine --split_random_halves --i Select/2d_classes_after_big_extraction/particles.star --ref Class3D/second_3d_clasification/run_it002_class004.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 50 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 1 –gpu 0
 === RELION MPI setup ===
 + Number of MPI processes             = 3
 + Master  (0) runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     1 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     2 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 =================
 Running CPU instructions in double precision. 
 Estimating initial noise spectra 
000/??? sec ~~(,_,">                                                          [o   0/   0 sec ...~~(,_,">+++ RELION: command line arguments (with defaults for optional ones between parantheses) +++
====== General options ===== 
                                --i : Input images (in a star-file or a stack)
                                --o : Output rootname
                        --iter (50) : Maximum number of iterations to perform
                      --angpix (-1) : Pixel size (in Angstroms)
                   --tau2_fudge (1) : Regularisation parameter (values higher than 1 give more weight to the data)
                            --K (1) : Number of references to be refined
           --particle_diameter (-1) : Diameter of the circular mask that will be applied to the experimental images (in Angstroms)
                --zero_mask (false) : Mask surrounding background in particles to zero (by default the solvent area is filled with random noise)
          --flatten_solvent (false) : Perform masking on the references as well?
              --solvent_mask (None) : User-provided mask for the references (default is to use spherical mask with particle_diameter)
             --solvent_mask2 (None) : User-provided secondary mask (with its own average density)
               --multibody_masks () : STAR file with binary masks for multi-body refinement
                       --tau (None) : STAR file with input tau2-spectrum (to be kept constant)
      --split_random_halves (false) : Refine two random halves of the data completely separately
       --low_resol_join_halves (-1) : Resolution (in Angstrom) up to which the two random half-reconstructions will not be independent to prevent diverging orientations
====== Initialisation ===== 
                       --ref (None) : Image, stack or star-file with the reference(s). (Compulsory for 3D refinement!)
                       --offset (3) : Initial estimated stddev for the origin offsets
             --firstiter_cc (false) : Perform CC-calculation in the first iteration (use this if references are not on the absolute intensity scale)
                    --ini_high (-1) : Resolution (in Angstroms) to which to limit refinement in the first iteration 
====== Orientations ===== 
                 --oversampling (1) : Adaptive oversampling order to speed-up calculations (0=no oversampling, 1=2x, 2=4x, etc)
                --healpix_order (2) : Healpix order for the angular sampling (before oversampling) on the (3D) sphere: hp2=15deg, hp3=7.5deg, etc
                    --psi_step (-1) : Sampling rate (before oversampling) for the in-plane angle (default=10deg for 2D, hp sampling for 3D)
                 --limit_tilt (-91) : Limited tilt angle: positive for keeping side views, negative for keeping top views
                         --sym (c1) : Symmetry group
                 --offset_range (6) : Search range for origin offsets (in pixels)
                  --offset_step (2) : Sampling rate (before oversampling) for origin offsets (in pixels)
         --helical_offset_step (-1) : Sampling rate (before oversampling) for offsets along helical axis (in pixels)
                    --perturb (0.5) : Perturbation factor for the angular sampling (0=no perturb; 0.5=perturb)
              --auto_refine (false) : Perform 3D auto-refine procedure?
     --auto_local_healpix_order (4) : Minimum healpix order (before oversampling) from which autosampling procedure will use local searches
                   --sigma_ang (-1) : Stddev on all three Euler angles for local angular searches (of +/- 3 stddev)
                   --sigma_rot (-1) : Stddev on the first Euler angle for local angular searches (of +/- 3 stddev)
                  --sigma_tilt (-1) : Stddev on the second Euler angle for local angular searches (of +/- 3 stddev)
                   --sigma_psi (-1) : Stddev on the in-plane angle for local angular searches (of +/- 3 stddev)
               --skip_align (false) : Skip orientational assignment (only classify)?
              --skip_rotate (false) : Skip rotational assignment (only translate and classify)?
              --bimodal_psi (false) : Do bimodal searches of psi angle?
====== Helical symmetry (in development...) ===== 
                    --helix (false) : Perform 3D classification or refinement for helices?
               --helical_nr_asu (1) : Number of new helical asymmetric units (asu) per box (1 means no helical symmetry is present)
       --helical_twist_initial (0.) : Helical twist (in degrees, positive values for right-handedness)
           --helical_twist_min (0.) : Minimum helical twist (in degrees, positive values for right-handedness)
           --helical_twist_max (0.) : Maximum helical twist (in degrees, positive values for right-handedness)
       --helical_twist_inistep (0.) : Initial step of helical twist search (in degrees)
        --helical_rise_initial (0.) : Helical rise (in Angstroms)
            --helical_rise_min (0.) : Minimum helical rise (in Angstroms)
            --helical_rise_max (0.) : Maximum helical rise (in Angstroms)
        --helical_rise_inistep (0.) : Initial step of helical rise search (in Angstroms)
       --helical_z_percentage (0.3) : This box length along the center of Z axis contains good information of the helix. Important in imposing and refining symmetry
     --helical_inner_diameter (-1.) : Inner diameter of helical tubes in Angstroms (for masks of helical references and particles)
     --helical_outer_diameter (-1.) : Outer diameter of helical tubes in Angstroms (for masks of helical references and particles)
  --helical_symmetry_search (false) : Perform local refinement of helical symmetry?
     --helical_sigma_distance (-1.) : Sigma of distance along the helical tracks
====== Corrections ===== 
                      --ctf (false) : Perform CTF correction?
    --ctf_intact_first_peak (false) : Ignore CTFs until their first peak?
        --ctf_corrected_ref (false) : Have the input references been CTF-amplitude corrected?
        --ctf_phase_flipped (false) : Have the data been CTF phase-flipped?
           --ctf_multiplied (false) : Have the data been premultiplied with their CTF?
         --only_flip_phases (false) : Only perform CTF phase-flipping? (default is full amplitude-correction)
                     --norm (false) : Perform normalisation-error correction?
                    --scale (false) : Perform intensity-scale corrections on image groups?
====== Computation ===== 
                         --pool (1) : Number of images to pool for each thread task
                            --j (1) : Number of threads to run in parallel (only useful on multi-core machines)
  --dont_combine_weights_via_disc (false) : Send the large arrays of summed weights through the MPI network, instead of writing large files to disc
          --onthefly_shifts (false) : Calculate shifted images on-the-fly, do not store precalculated ones in memory
      --no_parallel_disc_io (false) : Do NOT let parallel (MPI) processes access the disc simultaneously (use this option with NFS)
           --preread_images (false) : Use this to let the master process read all particles into memory. Be careful you have enough RAM for large data sets!
                   --scratch_dir () : If provided, particle stacks will be copied to this local scratch disk prior to refinement.
           --keep_free_scratch (10) : Space available for copying particle stacks (in Gb)
                      --gpu (false) : Use available gpu resources for some calculations
              --free_gpu_memory (0) : GPU device memory (in Mb) to leave free after allocation.
====== Expert options ===== 
                          --pad (2) : Oversampling factor for the Fourier transforms of the references
                       --NN (false) : Perform nearest-neighbour instead of linear Fourier-space interpolation?
                    --r_min_nn (10) : Minimum number of Fourier shells to perform linear Fourier-space interpolation
                         --verb (1) : Verbosity (1=normal, 0=silent)
                 --random_seed (-1) : Number for the random seed generator
                 --coarse_size (-1) : Maximum image size for the first pass of the adaptive sampling approach
        --adaptive_fraction (0.999) : Fraction of the weights to be considered in the first pass of adaptive oversampling 
                     --maskedge (5) : Width of the soft edge of the spherical mask (in pixels)
          --fix_sigma_noise (false) : Fix the experimental noise spectra?
         --fix_sigma_offset (false) : Fix the stddev in the origin offsets?
                   --incr_size (10) : Number of Fourier shells beyond the current resolution to be included in refinement
    --print_metadata_labels (false) : Print a table with definitions of all metadata labels, and exit
       --print_symmetry_ops (false) : Print all symmetry transformation matrices, and exit
          --strict_highres_exp (-1) : Resolution limit (in Angstrom) to restrict probability calculations in the expectation step
          --dont_check_norm (false) : Skip the check whether the images are normalised correctly
                --always_cc (false) : Perform CC-calculation in all iterations (useful for faster denovo model generation?)
      --solvent_correct_fsc (false) : Correct FSC curve for the effects of the solvent mask?
====== MPI options ===== 
  --only_do_unfinished_movies (false) : When processing movies on a per-micrograph basis, ignore those movies for which the output STAR file already exists.
Experiment::write: Cannot write file: Refine3D/job069/run_test_it000_data.star
File: /home/iiciieii/bin/relion2-beta/src/exp_model.cpp line: 1178
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Try this:

#!bash

mpirun --mca orte_base_help_aggregate 0 -n 3 relion_refine_mpi --o Refine3D/job067/run_test --auto_refine --split_random_halves --i Select/2d_classes_after_big_extraction/particles.star --ref Class3D/second_3d_clasification/run_it002_class004.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 50 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 1 --gpu

If you have 2 GPUs and just want to use one of them, only then should you use --gpu 0.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

The exact command line was:

#!c++

[relion_refine_mpi --o Refine3D/job067/run --auto_refine --split_random_halves --i Select/2d_classes_after_big_extraction/particles.star --ref Class3D/second_3d_clasification/run_it002_class004.mrc --firstiter_cc --ini_high 50 --dont_combine_weights_via_disc --pool 3 --ctf --ctf_corrected_ref --particle_diameter 200 --flatten_solvent --zero_mask --oversampling 1 --healpix_order 2 --auto_local_healpix_order 4 --offset_range 5 --offset_step 2 --sym D2 --low_resol_join_halves 40 --norm --scale  --j 1 --gpu 0

If I enter this command manually on a terminal I have the following output:

#!c++

=== RELION MPI setup ===
 + Number of MPI processes             = 2
 + Master  (0) runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 + Slave     1 runs on host            = rekiem-t5400.cscdom.csc.mrc.ac.uk
 =================
 Running CPU instructions in double precision. 
+++ RELION: command line arguments (with defaults for optional ones between parantheses) +++
====== General options ===== 
                                --i : Input images (in a star-file or a stack)
                                --o : Output rootname
                        --iter (50) : Maximum number of iterations to perform
                      --angpix (-1) : Pixel size (in Angstroms)
                   --tau2_fudge (1) : Regularisation parameter (values higher than 1 give more weight to the data)
                            --K (1) : Number of references to be refined
           --particle_diameter (-1) : Diameter of the circular mask that will be applied to the experimental images (in Angstroms)
                --zero_mask (false) : Mask surrounding background in particles to zero (by default the solvent area is filled with random noise)
          --flatten_solvent (false) : Perform masking on the references as well?
              --solvent_mask (None) : User-provided mask for the references (default is to use spherical mask with particle_diameter)
             --solvent_mask2 (None) : User-provided secondary mask (with its own average density)
               --multibody_masks () : STAR file with binary masks for multi-body refinement
                       --tau (None) : STAR file with input tau2-spectrum (to be kept constant)
      --split_random_halves (false) : Refine two random halves of the data completely separately
       --low_resol_join_halves (-1) : Resolution (in Angstrom) up to which the two random half-reconstructions will not be independent to prevent diverging orientations
====== Initialisation ===== 
                       --ref (None) : Image, stack or star-file with the reference(s). (Compulsory for 3D refinement!)
                       --offset (3) : Initial estimated stddev for the origin offsets
             --firstiter_cc (false) : Perform CC-calculation in the first iteration (use this if references are not on the absolute intensity scale)
                    --ini_high (-1) : Resolution (in Angstroms) to which to limit refinement in the first iteration 
====== Orientations ===== 
                 --oversampling (1) : Adaptive oversampling order to speed-up calculations (0=no oversampling, 1=2x, 2=4x, etc)
                --healpix_order (2) : Healpix order for the angular sampling (before oversampling) on the (3D) sphere: hp2=15deg, hp3=7.5deg, etc
                    --psi_step (-1) : Sampling rate (before oversampling) for the in-plane angle (default=10deg for 2D, hp sampling for 3D)
                 --limit_tilt (-91) : Limited tilt angle: positive for keeping side views, negative for keeping top views
                         --sym (c1) : Symmetry group
                 --offset_range (6) : Search range for origin offsets (in pixels)
                  --offset_step (2) : Sampling rate (before oversampling) for origin offsets (in pixels)
         --helical_offset_step (-1) : Sampling rate (before oversampling) for offsets along helical axis (in pixels)
                    --perturb (0.5) : Perturbation factor for the angular sampling (0=no perturb; 0.5=perturb)
              --auto_refine (false) : Perform 3D auto-refine procedure?
     --auto_local_healpix_order (4) : Minimum healpix order (before oversampling) from which autosampling procedure will use local searches
                   --sigma_ang (-1) : Stddev on all three Euler angles for local angular searches (of +/- 3 stddev)
                   --sigma_rot (-1) : Stddev on the first Euler angle for local angular searches (of +/- 3 stddev)
                  --sigma_tilt (-1) : Stddev on the second Euler angle for local angular searches (of +/- 3 stddev)
                   --sigma_psi (-1) : Stddev on the in-plane angle for local angular searches (of +/- 3 stddev)
               --skip_align (false) : Skip orientational assignment (only classify)?
              --skip_rotate (false) : Skip rotational assignment (only translate and classify)?
              --bimodal_psi (false) : Do bimodal searches of psi angle?
====== Helical symmetry (in development...) ===== 
                    --helix (false) : Perform 3D classification or refinement for helices?
               --helical_nr_asu (1) : Number of new helical asymmetric units (asu) per box (1 means no helical symmetry is present)
       --helical_twist_initial (0.) : Helical twist (in degrees, positive values for right-handedness)
           --helical_twist_min (0.) : Minimum helical twist (in degrees, positive values for right-handedness)
           --helical_twist_max (0.) : Maximum helical twist (in degrees, positive values for right-handedness)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
       --helical_twist_inistep (0.) : Initial step of helical twist search (in degrees)
        --helical_rise_initial (0.) : Helical rise (in Angstroms)
            --helical_rise_min (0.) : Minimum helical rise (in Angstroms)
            --helical_rise_max (0.) : Maximum helical rise (in Angstroms)
        --helical_rise_inistep (0.) : Initial step of helical rise search (in Angstroms)
       --helical_z_percentage (0.3) : This box length along the center of Z axis contains good information of the helix. Important in imposing and refining symmetry
     --helical_inner_diameter (-1.) : Inner diameter of helical tubes in Angstroms (for masks of helical references and particles)
     --helical_outer_diameter (-1.) : Outer diameter of helical tubes in Angstroms (for masks of helical references and particles)
  --helical_symmetry_search (false) : Perform local refinement of helical symmetry?
     --helical_sigma_distance (-1.) : Sigma of distance along the helical tracks
====== Corrections ===== 
                      --ctf (false) : Perform CTF correction?
    --ctf_intact_first_peak (false) : Ignore CTFs until their first peak?
        --ctf_corrected_ref (false) : Have the input references been CTF-amplitude corrected?
        --ctf_phase_flipped (false) : Have the data been CTF phase-flipped?
           --ctf_multiplied (false) : Have the data been premultiplied with their CTF?
         --only_flip_phases (false) : Only perform CTF phase-flipping? (default is full amplitude-correction)
                     --norm (false) : Perform normalisation-error correction?
                    --scale (false) : Perform intensity-scale corrections on image groups?
====== Computation ===== 
                         --pool (1) : Number of images to pool for each thread task
                            --j (1) : Number of threads to run in parallel (only useful on multi-core machines)
  --dont_combine_weights_via_disc (false) : Send the large arrays of summed weights through the MPI network, instead of writing large files to disc
          --onthefly_shifts (false) : Calculate shifted images on-the-fly, do not store precalculated ones in memory
      --no_parallel_disc_io (false) : Do NOT let parallel (MPI) processes access the disc simultaneously (use this option with NFS)
           --preread_images (false) : Use this to let the master process read all particles into memory. Be careful you have enough RAM for large data sets!
                   --scratch_dir () : If provided, particle stacks will be copied to this local scratch disk prior to refinement.
           --keep_free_scratch (10) : Space available for copying particle stacks (in Gb)
                      --gpu (false) : Use available gpu resources for some calculations
              --free_gpu_memory (0) : GPU device memory (in Mb) to leave free after allocation.
====== Expert options ===== 
                          --pad (2) : Oversampling factor for the Fourier transforms of the references
                       --NN (false) : Perform nearest-neighbour instead of linear Fourier-space interpolation?
                    --r_min_nn (10) : Minimum number of Fourier shells to perform linear Fourier-space interpolation
                         --verb (1) : Verbosity (1=normal, 0=silent)
                 --random_seed (-1) : Number for the random seed generator
                 --coarse_size (-1) : Maximum image size for the first pass of the adaptive sampling approach
        --adaptive_fraction (0.999) : Fraction of the weights to be considered in the first pass of adaptive oversampling 
                     --maskedge (5) : Width of the soft edge of the spherical mask (in pixels)
          --fix_sigma_noise (false) : Fix the experimental noise spectra?
         --fix_sigma_offset (false) : Fix the stddev in the origin offsets?
                   --incr_size (10) : Number of Fourier shells beyond the current resolution to be included in refinement
    --print_metadata_labels (false) : Print a table with definitions of all metadata labels, and exit
       --print_symmetry_ops (false) : Print all symmetry transformation matrices, and exit
          --strict_highres_exp (-1) : Resolution limit (in Angstrom) to restrict probability calculations in the expectation step
          --dont_check_norm (false) : Skip the check whether the images are normalised correctly
                --always_cc (false) : Perform CC-calculation in all iterations (useful for faster denovo model generation?)
      --solvent_correct_fsc (false) : Correct FSC curve for the effects of the solvent mask?
====== MPI options ===== 
  --only_do_unfinished_movies (false) : When processing movies on a per-micrograph basis, ignore those movies for which the output STAR file already exists.

I hope it helps! :) in the middle, it says:

#!c++

MPI_ABORT was invoked on rank 1 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

what is the exact command-line used to produce mpi_2.png ? And what is the exact screen output?

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

It is loaded, yes.

#!c++

[iiciieii@rekiem-t5400 betagal]$ module list
Currently Loaded Modulefiles:
  1) mpi/openmpi-x86_64

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

when you say that is it available, do you mean loaded? If it shows up by a

#!bash

module avail

then that just means that you have it on your computer. You still need to load it. If you don't see it by doing

#!bash

module list

then it is not loaded. But it must be loaded if it even starts, I think.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Nope, if I try anything with MPI I still have the same error "mpirun exited on signal 11 (Segmentation fault)" :(. However, the module is available and in principle everything seems OK...

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

is there a run you could do with mpi (2d or 3d), which is not auto-refine? Just to make sure it works? Or do you know this works already?

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

I tried adding the --j 1 in additional arguments with 1 or 8 threads, but the result was the same. It says: Cannot split data into random halve without using MPI.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Well, not really. RELION uses some space for semi-permanent objects, then it grabs the rest for "casual use", i.e. it hogs as much of the GPU memory as possible to use its own personal memory management. How many threads was this (--j)?

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Sorry! I had it wrong. With the command -j 1 nothing happens, still the two halves error. About the other jobs, I have the same error of exiting in signal 11.

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Sorry, I did not see the relion_refine -j 1 command!! I will check now!

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

is there a run you could do with mpi (2d or 3d), which is not auto-refine? Just to make sure it works? Or do you know this works already?

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

If I try to skip the MPI there is another error, because the data cannot be split into random halves...

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

Of course! No error is reported, so run.err is actually empty. This is a fresh job I have just run with 8 threads and 3 MPI processes. I was checking the GPU usage when the job tried to run and the memory usage was just 544 during all the time. It seems that the card did not start working at all.

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

what happens if you skip mpi? That is

relion_refine -j 1

bforsbe commented 8 years ago

Original comment by Alberto Riera (Bitbucket: iiciieii, GitHub: Unknown):

It is not a lot of RAM, very true...I was trying to run just 1 MPI process on the card with different combinations in the Number of threads, but I was not lucky either :(

bforsbe commented 8 years ago

Original comment by Bjoern Forsberg (Bitbucket: bforsbe, GitHub: bforsbe):

Yes, but this dies much too quickly. Could you attach the run.out and run.err files?

bforsbe commented 8 years ago

Original comment by Sjors Scheres (Bitbucket: scheres, GitHub: scheres):

Probably running 2 MPIs on 1 card just is too much. 4Gb is not much memory....

3dem / relion

3D auto-refine: mpirun exited on signal 11 (Segmentation fault) #59