relion_refine_mpi runs dog slow with stack input

The latest relion_refine_mpi runs 10x-20x slower when given a stack as input, rather than a star file. These tests are with cuda8.0, and a Tesla K80:

[ec2-user@ip-172-31-8-78 ~]$ mpirun -np 2 /home/ec2-user/relion/build/bin/relion_refine_mpi --i preprocess_test.mrcs --o testdir --angpix 0.85 --K 20 --gpu
 === RELION MPI setup ===
 + Number of MPI processes             = 2
 + Master  (0) runs on host            = ip-172-31-8-78
 + Slave     1 runs on host            = ip-172-31-8-78
 =================
 uniqueHost ip-172-31-8-78 has 1 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
 Estimating initial noise spectra 
  21/  21 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ... 
   0/   0 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 30.1 degrees; offsets= 10.1 pixels
 CurrentResolution= 11.6571 Angstroms, which requires orientationSampling of at least 17.1429 degrees for a particle of diameter 77.35 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 37120
 OrientationalSampling= 5.625 NrOrientations= 64
 TranslationalSampling= 2 NrTranslations= 29
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1187840
 OrientationalSampling= 2.8125 NrOrientations= 512
 TranslationalSampling= 1 NrTranslations= 116
=============================
 Expectation iteration 1 of 50
0.02/1.02 hrs .~~(,_,">

With a starfile:

[ec2-user@ip-172-31-8-78 ~]$ mpirun -np 2 /home/ec2-user/relion/build/bin/relion_refine_mpi --i qstack10-complete_relion_stack.star --o testdir --angpix 0.85 --dont_check_norm --K 20 --gpu
 === RELION MPI setup ===
 + Number of MPI processes             = 2
 + Master  (0) runs on host            = ip-172-31-8-78
 + Slave     1 runs on host            = ip-172-31-8-78
 =================
 uniqueHost ip-172-31-8-78 has 1 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
 Estimating initial noise spectra 
  21/  21 sec ............................................................~~(,_,">
 Estimating accuracies in the orientational assignment ... 
   0/   0 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.4 degrees; offsets= 0.3 pixels
 CurrentResolution= 11.6571 Angstroms, which requires orientationSampling of at least 17.1429 degrees for a particle of diameter 77.35 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 37120
 OrientationalSampling= 5.625 NrOrientations= 64
 TranslationalSampling= 2 NrTranslations= 29
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1187840
 OrientationalSampling= 2.8125 NrOrientations= 512
 TranslationalSampling= 1 NrTranslations= 116
=============================
 Expectation iteration 1 of 50

0.10/5.98 min .~~(,_,">^C[ec2-user@ip-172-31-8-78 ~]$                         [oo]

And with relion v2.0:

[ec2-user@ip-172-31-8-78 ~]$ mpirun -np 2 /home/ec2-user/relion-2-cuda8/build/bin/relion_refine_mpi --i preprocess_test.mrcs --o testdir --angpix 0.85 --K 20 --gpu
 === RELION MPI setup ===
 + Number of MPI processes             = 2
 + Master  (0) runs on host            = ip-172-31-8-78
 + Slave     1 runs on host            = ip-172-31-8-78
 =================
 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
 Estimating initial noise spectra 
  13/  13 sec ............................................................~~(,_,">
 uniqueHost ip-172-31-8-78 has 1 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Estimating accuracies in the orientational assignment ... 
   1/   1 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 30.1 degrees; offsets= 10.1 pixels
 CurrentResolution= 11.6571 Angstroms, which requires orientationSampling of at least 17.1429 degrees for a particle of diameter 77.35 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 37120
 OrientationalSampling= 5.625 NrOrientations= 64
 TranslationalSampling= 2 NrTranslations= 29
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1187840
 OrientationalSampling= 2.8125 NrOrientations= 512
 TranslationalSampling= 1 NrTranslations= 116
=============================
 Expectation iteration 1 of 50
0.50/6.65 min ....~~(,_,">^C[ec2-user@ip-172-31-8-78 ~]$                      [oo]

With relion v2.0 and starfile:

[ec2-user@ip-172-31-8-78 ~]$ mpirun -np 2 /home/ec2-user/relion-2-cuda8/build/bin/relion_refine_mpi --i qstack10-complete_relion_stack.star --o testdir --angpix 0.85 --K 20 --gpu --dont_check_norm
 === RELION MPI setup ===
 + Number of MPI processes             = 2
 + Master  (0) runs on host            = ip-172-31-8-78
 + Slave     1 runs on host            = ip-172-31-8-78
 =================
 Running CPU instructions in double precision. 
 + WARNING: Changing psi sampling rate (before oversampling) to 5.625 degrees, for more efficient GPU calculations
 Estimating initial noise spectra 
  11/  11 sec ............................................................~~(,_,">
 uniqueHost ip-172-31-8-78 has 1 ranks.
GPU-ids not specified for this rank, threads will automatically be mapped to available devices.
 Thread 0 on slave 1 mapped to device 0
 Estimating accuracies in the orientational assignment ... 
   1/   1 sec ............................................................~~(,_,">
 Auto-refine: Estimated accuracy angles= 0.3 degrees; offsets= 0.15 pixels
 CurrentResolution= 11.6571 Angstroms, which requires orientationSampling of at least 17.1429 degrees for a particle of diameter 77.35 Angstroms
 Oversampling= 0 NrHiddenVariableSamplingPoints= 37120
 OrientationalSampling= 5.625 NrOrientations= 64
 TranslationalSampling= 2 NrTranslations= 29
=============================
 Oversampling= 1 NrHiddenVariableSamplingPoints= 1187840
 OrientationalSampling= 2.8125 NrOrientations= 512
 TranslationalSampling= 1 NrTranslations= 116
=============================
 Expectation iteration 1 of 50
0.52/4.12 min .......~~(,_,">[ec2-user@ip-172-31-8-78 ~]$                     [oo]

3dem / relion

relion_refine_mpi runs dog slow with stack input #297