Closed gdevenyi closed 9 years ago
In general having as many executors as the number of subjects in the pipeline makes most sense. Many parts of the pipeline involve working on a stage for each input file in parallel, after which there is a stage which combines that information before it can continue with each individual file in parallel again.
Given that, the fact that the number of running executors ended up being the number of input files makes sense. When one input file takes a lot longer for a particular (non linear) registration stage than the others, it also makes sense for the remaining executors to die, because the whole pipeline can only continue when that stage finishes (to create the next average to be used as target for the following registration stage for instance).
Generally, when new stages become available, the server process will spawn new executors. So when the longer running registration stage completed, that should have happened. Did you run the check_pipeline_status.py program on the uri to find out what the state of the pipeline was when it was stranded?
----- Original Message -----
From: "Gabriel A. Devenyi" notifications@github.com To: "Mouse-Imaging-Centre/pydpiper" pydpiper@noreply.github.com Sent: Tuesday, November 18, 2014 9:51:52 AM Subject: [pydpiper] BUG: pypdpiper SGE executors die for long processing steps, leave pipeline stranded (#153)
I'm running MBM on a SGE cluster. My run submission was; MBM.py --queue=sge --num-executors=240 --proc=1 --mem=4 --verbose --pipeline-name=rat --lsq6-large-rotations --registration-method=mincANTS --calc-stats --init-model=../MBM-input/model/A10FISP.mnc ../MBM-input/*mnc
The jobs executors seem to get paired down to the number of input files (25) and process away.
For some reason, one of the non-linear registrations too much longer than the others to complete, in that time, it looks like the other executors died off.
When the longer-running registration completed, no new executors were spawned, and the pipeline was left stranded with nothing to do the work.
— Reply to this email directly or view it on GitHub .
Yes, I did. Here's the status of stranded pipeline
$ check_pipeline_status.py uri
Total number of stages in the pipeline: 3459
Number of stages already processed: 2252
Number of active clients: 0
Number of clients waiting in the queue: 0
Currently running stages:
Number of runnable stages: 25
Hmmm... I'll go through the code's logic to see why no new executors were launched in this case. When the sum of active and waiting clients is smaller than the number of runnable stages, new executors should be launched.
Did you ctrl+c the pipeline when it was stuck and rerun the same command? If so, did that pick up where the pipeline had left?
----- Original Message -----
From: "Gabriel A. Devenyi" notifications@github.com To: "Mouse-Imaging-Centre/pydpiper" pydpiper@noreply.github.com Cc: "Matthijs van Eede" matthijs@mouseimaging.ca Sent: Tuesday, November 18, 2014 10:19:37 AM Subject: Re: [pydpiper] BUG: pypdpiper SGE executors die for long processing steps, leave pipeline stranded (#153)
Yes, I did. Here's the status of stranded pipeline
$ check_pipeline_status.py uri
Total number of stages in the pipeline: 3459
Number of stages already processed: 2252
Number of active clients: 0 Number of clients waiting in the queue: 0
Currently running stages:
Number of runnable stages: 25
— Reply to this email directly or view it on GitHub .
Yup, when I Ctrl-C'd and restarted the pipeline continued.
Incidentally, it just happened again after finishing another round of nonlinear registration. Had to restart again to get it to continue with xfmconcat
A couple of questions. In the pipelines that stranded, did you see any warning messages like these:
" Warning: there has been no contact with ..., for ... seconds. Considering the executor as dead!
And did the following line occur in one of the pipeline logs:
" Too many executors lost to spawn new ones "
----- Original Message -----
From: "Gabriel A. Devenyi" notifications@github.com To: "Mouse-Imaging-Centre/pydpiper" pydpiper@noreply.github.com Cc: "Matthijs van Eede" matthijs@mouseimaging.ca Sent: Tuesday, November 18, 2014 5:37:23 PM Subject: Re: [pydpiper] BUG: pypdpiper SGE executors die for long processing steps, leave pipeline stranded (#153)
Yup, when I Ctrl-C'd and restarted the pipeline continued.
Incidentally, it just happened again after finishing another round of nonlinear registration. Had to restart again to get it to continue with xfmconcat
— Reply to this email directly or view it on GitHub .
The output to screen is long gone, so I can't comment on the first one, but for the second:
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 09:42:20.792,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 09:47:21.095,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 09:47:21.095,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 10:32:23.846,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 10:32:23.847,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 12:37:01.405,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 14:33:08.562,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
Based on your line of questioning, this is maybe due to my submission of so many executors?
This warning occurs when the server loses contact with an executor - e.g., due to an executor bug, machine crash, or network partition - and not just normal termination. Your problem occurs since there's a limit -- controlled by --max-failed-executors, default 2 -- to the number of failures allowed, to prevent infinite spawning of executors if, e.g., pipeline_executor.py has a bug or things aren't installed properly. It's hard to tell what happened here ... are there any clues in the executor logs as to whether the machine failed or an exception was raised? (One possibility is that the executors are neglecting to unregister for some reason.)
Admittedly, the server should be slightly smarter about noticing this situation.
From what I can tell, nothing went wrong with any of the executors, here's the context of "too many executors lost" from pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log
[2014-11-19 09:37:35.500,pydpiper.pipeline,WARNING] Executor at PYRO:obj_c176f9305ca94affb4ec6391dcab3250@172.16.67.225:43948 has died!
[2014-11-19 09:37:35.505,pydpiper.pipeline,WARNING] Executor at PYRO:obj_05dd7225a6be42b1a9050f86bc53517c@172.16.67.223:49272 has died!
[2014-11-19 09:42:20.791,pydpiper.pipeline,WARNING] Executor at PYRO:obj_3979daf0c6eb4ac29e457d5c16d7afc6@172.16.67.225:50443 has died!
[2014-11-19 09:42:20.792,pydpiper.pipeline,WARNING] Executor at PYRO:obj_39aae0d4565e4d579f91fb35541838e1@172.16.67.223:33962 has died!
[2014-11-19 09:42:20.792,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
[2014-11-19 09:47:21.094,pydpiper.pipeline,WARNING] Executor at PYRO:obj_b9934b9f92b94fa2b77bdcb54f897ec2@172.16.67.223:36835 has died!
[2014-11-19 09:47:21.095,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
[2014-11-19 09:47:21.095,pydpiper.pipeline,WARNING] Executor at PYRO:obj_c37328dfb4ab4166bb4d71e3f2d80f2c@172.16.67.225:40858 has died!
[2014-11-19 09:47:21.095,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
[2014-11-19 10:32:23.846,pydpiper.pipeline,WARNING] Executor at PYRO:obj_6eae34c5c4af4bd1a35a1e51aba7d1b4@172.16.67.225:34929 has died!
[2014-11-19 10:32:23.846,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
[2014-11-19 10:32:23.846,pydpiper.pipeline,WARNING] Executor at PYRO:obj_21990f8d3d204c7880fde5e75b04bd4b@172.16.67.223:42766 has died!
[2014-11-19 10:32:23.847,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
[2014-11-19 12:37:01.404,pydpiper.pipeline,WARNING] Executor at PYRO:obj_b5630c1ddb08407bac4175129faa4a7e@172.16.67.222:42054 has died!
[2014-11-19 12:37:01.405,pydpiper.pipeline,WARNING] Too many executors lost to spawn new ones
And if I find the log for executor 42054: pipeline_executor-2014-11-19-at-09:34:35-pid-3254.log
[2014-11-19 09:34:35.863,Pyro4,INFO] Pyro log configured using built-in defaults, level=INFO
[2014-11-19 09:34:36.918,Pyro4.multiplexserver,INFO] starting multiplexed socketserver
[2014-11-19 09:34:36.928,__main__,INFO] Connected to PYRO:obj_4a855a28532b46ed90feb793142ee215@172.16.67.229:33768
[2014-11-19 09:34:36.929,__main__,INFO] Client URI is PYRO:obj_b5630c1ddb08407bac4175129faa4a7e@172.16.67.222:42054
[2014-11-19 09:34:36.929,__main__,INFO] Connected to the server at: 2014-11-19 09:34:36.929620
[2014-11-19 09:34:37.443,Pyro4.core,INFO] daemon 172.16.67.222:42054 entering requestloop
[2014-11-19 09:34:37.448,__main__,INFO] Running stage 10962:
[2014-11-19 09:34:37.459,__main__,INFO] mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
[2014-11-19 14:00:58.145,__main__,INFO] Stage 10962 finished, return was: 0
[2014-11-19 14:00:59.416,__main__,ERROR] Error communicating to server in runStage. Error raised to calling thread in launchExecutor.
Traceback (most recent call last):
File "/opt/quarantine/pydpiper/1.10/build/bin/pipeline_executor.py", line 197, in runStage
client.notifyStageTerminated(i, ret)
File "/opt/quarantine/anaconda/2.1.0/build/lib/python2.7/site-packages/Pyro4/core.py", line 168, in __call__
return self.__send(self.__name, args, kwargs)
File "/opt/quarantine/anaconda/2.1.0/build/lib/python2.7/site-packages/Pyro4/core.py", line 373, in _pyroInvoke
raise data
ValueError: list.remove(x): x not in list
[2014-11-19 14:00:59.513,__main__,INFO] Executor shutting down.
Is this a stage crash, or a bug?
This is an executor bug ... we're currently fixing the error messages and adding some logic so the server will shutdown in this situation. The cause of the bug is not clear - does the pipeline log show stage 10962 being started multiple times? We'll shortly release 1.11-alpha, which has some logic to detect this situation (and has different client tracking internals) - please consider upgrading. (Note: requires the ConfigArgParse library, a very light depedency.)
Yup, "Finshed" multiple times, though the logs only say "Running" once,
pipeline_executor-2014-11-19-at-09:34:35-pid-3254.log:[2014-11-19 09:34:37.448,__main__,INFO] Running stage 10962:
pipeline_executor-2014-11-19-at-09:23:42-pid-13956.log:[2014-11-19 14:00:58.146,pydpiper.pipeline,INFO] Finished Stage 10962: mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
pipeline_executor-2014-11-19-at-15:45:57-pid-13551.log:[2014-11-19 15:46:11.207,pydpiper.pipeline,INFO] Finished Stage 10962: mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
pipeline_executor-2014-11-20-at-09:02:41-pid-31772.log:[2014-11-20 09:03:01.854,pydpiper.pipeline,INFO] Finished Stage 10962: mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
pipeline_executor-2014-11-20-at-09:25:43-pid-8238.log:[2014-11-20 09:25:59.324,pydpiper.pipeline,INFO] Finished Stage 10962: mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
As soon as 1.11-alpha is tagged I can give it a try.
In this pipeline, are there any messages in the log files that contain "WARN" or "ERR"?
Also, could you give the content of the log file for Stage 10962? It should be in this directory:
/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/log/
And I think should end in something along the lines of: ...-nlin-1_nlin_0.log
I think this is the logfile you're referring to, it doesn't list Stage 10962 in it's name or in it's contents, but it's the only think that matches your naming suggestions:
Running on: ciccs03 at 2014-11-19 09:34:37.483371
mincANTS 3 --number-of-affine-iterations 0 -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc,1,3]' -m 'CC[/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc,/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc,1,3]' -t 'SyN[0.4]' -r 'Gauss[5,1]' -i 100x100x100x20 -o /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm -x /data/chamal/projects/gabriel/transgentic_rat_build_model/average_model/average_mask.mnc
Run Reg
values 2
Fixed image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc
Moving image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc
Metric 0: Not a Point-set
Fixed image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_fwhm0.2_dxyz.mnc
Moving image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1/tmp/avgmodel-nlin-1_fwhm0.2_dxyz.mnc
similarity metric weight: 1
Radius: [3, 3, 3]
Radius: [3, 3, 3]
Fixed image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc
Moving image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc
Metric 1: Not a Point-set
Fixed image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/resampled/A30FISP-resampled-lsq12.mnc
Moving image file: /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_nlin/avgmodel-nlin-1.mnc
similarity metric weight: 1
Radius: [3, 3, 3]
Radius: [3, 3, 3]
Use identity affine transform as initial affine para.
aff_init.IsNull()==1
Use identity affine transform as initial fixed affine para.
fixed_aff_init.IsNull()==1
Continue affine registration from the input
affine_opt.use_rotation_header = 0
affine_opt.ignore_void_orgin = 0
transform_initial: IsNotNull():0
OptAffine: metric_type=AffineWithMutualInformation
MI_bins=32 MI_samples=32000
number_of_seeds=0 time_seed=1416407687
number_of_levels=1
number_of_iteration_list=[0]
graident_scales=[1,1,1,1,1,1,1,1,1,1,0.0001,0.0001,0.0001]
is_rigid = 0
mask null: 0
maximum_step_length=0.1
relaxation_factor=0.5
minimum_step_length=0.0001
translation_scales=0.0001
opt.transform_initial.IsNull(): 1
opt.use_rotation_header: 0
opt.ignore_void_orgin: 0
input affine center: [18.2299, 20.6236, 16.6811]
input affine para: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, -0.403262, 0.474548, 0.702787]
level 0, iter 0, size: fix[180, 180, 180]-mov[180, 180, 180], affine para: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, -0.403262, 0.474548, 0.702787]
does not reach oscillation, current step: 0.1>0.0001
final [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, -0.403262, 0.474548, 0.702787]
outputput affine center: [18.2299, 20.6236, 16.6811]
output affine para: [0, 0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
initial measure value (MMI): rval = -0.0264113
final measure value (MMI): rval = -0.0418403
finish affine registeration...
Requested Transformation Model: SyN : Using
SyN diffeomorphic model for transformation.
Grad Step 0.4 total-smoothing 1 gradient-smoothing 5
setting N-TimeSteps = 1 trunc 256
this->m_ScaleFactor 8 nlev 4 curl 0
allocated def field 1 0 0
0 1 0
0 0 1
Its at this level 100
Allocating
Allocating Done
iteration 1 energy 0 : -3972.41 energy 1 : -5162.58
iteration 2 energy 0 : -4370.52 energy 1 : -5581.19
iteration 3 energy 0 : -4519.98 energy 1 : -5686.71
iteration 4 energy 0 : -4568.17 energy 1 : -5688.45
iteration 5 energy 0 : -4611.24 energy 1 : -5760.56
iteration 6 energy 0 : -4632.74 energy 1 : -5760.82
iteration 7 energy 0 : -4668.67 energy 1 : -5823.96
iteration 8 energy 0 : -4667.18 energy 1 : -5798.26
iteration 9 energy 0 : -4742.37 energy 1 : -5905.58
iteration 10 energy 0 : -4689.94 energy 1 : -5833.66
iteration 11 energy 0 : -4789.78 energy 1 : -5958.92
iteration 12 energy 0 : -4698.43 energy 1 : -5846.55
E-Slope 0.00371078 iteration 13 energy 0 : -4814.54 energy 1 : -5982.38
E-Slope 0.00249081 iteration 14 energy 0 : -4702.26 energy 1 : -5851.88
E-Slope 0.00230401 iteration 15 energy 0 : -4830.99 energy 1 : -5996.11
E-Slope 0.00164519 iteration 16 energy 0 : -4710.79 energy 1 : -5856.54
E-Slope 0.00164425 iteration 17 energy 0 : -4847.06 energy 1 : -6010.22
E-Slope 0.00108699 iteration 18 energy 0 : -4723.06 energy 1 : -5867.14
E-Slope 0.0011884 iteration 19 energy 0 : -4858.23 energy 1 : -6018.4
E-Slope 0.000646874 iteration 20 energy 0 : -4730.31 energy 1 : -5871.54
E-Slope 0.000965595 iteration 21 energy 0 : -4866.46 energy 1 : -6024
E-Slope 0.00042627 iteration 22 energy 0 : -4735.22 energy 1 : -5874.66
E-Slope 0.000872478 iteration 23 energy 0 : -4872.61 energy 1 : -6027.59
E-Slope 0.000292666 iteration 24 energy 0 : -4738.7 energy 1 : -5876.24
E-Slope 0.000802256 iteration 25 energy 0 : -4882.3 energy 1 : -6036.12
E-Slope 0.000170343 iteration 26 energy 0 : -4740.52 energy 1 : -5876.84
E-Slope 0.000701241 iteration 27 energy 0 : -4887.14 energy 1 : -6038.65
E-Slope 5.01052e-05 iteration 28 energy 0 : -4741.78 energy 1 : -5876.1
Converged due to oscillation in optimization metric 0 bad 11 metric 1 bad 11
this->m_ScaleFactor 4 nlev 4 curl 1
Its at this level 100
iteration 1 energy 0 : -16277.4 energy 1 : -18819.4
iteration 2 energy 0 : -16500.4 energy 1 : -19216.8
iteration 3 energy 0 : -16852.7 energy 1 : -19412.4
iteration 4 energy 0 : -16818.5 energy 1 : -19352.7
iteration 5 energy 0 : -16949.8 energy 1 : -19459.7
iteration 6 energy 0 : -16871.8 energy 1 : -19385.1
iteration 7 energy 0 : -16964.4 energy 1 : -19477.2
iteration 8 energy 0 : -16879.9 energy 1 : -19387.2
iteration 9 energy 0 : -16951.8 energy 1 : -19473.9
iteration 10 energy 0 : -16867.5 energy 1 : -19378.1
iteration 11 energy 0 : -16928.6 energy 1 : -19463
iteration 12 energy 0 : -16853.8 energy 1 : -19369.7
E-Slope 0.000548586 iteration 13 energy 0 : -16908.6 energy 1 : -19453.8
E-Slope 5.51343e-06 iteration 14 energy 0 : -16844.1 energy 1 : -19365.5
Converged due to oscillation in optimization metric 0 bad 6 metric 1 bad 6
this->m_ScaleFactor 2 nlev 4 curl 2
Its at this level 100
iteration 1 energy 0 : -117254 energy 1 : -138282
iteration 2 energy 0 : -132214 energy 1 : -152635
iteration 3 energy 0 : -136829 energy 1 : -155155
iteration 4 energy 0 : -138151 energy 1 : -156396
iteration 5 energy 0 : -139701 energy 1 : -157167
iteration 6 energy 0 : -140441 energy 1 : -158845
iteration 7 energy 0 : -140724 energy 1 : -158149
iteration 8 energy 0 : -141399 energy 1 : -160186
iteration 9 energy 0 : -140938 energy 1 : -158272
iteration 10 energy 0 : -141842 energy 1 : -160646
iteration 11 energy 0 : -141060 energy 1 : -158695
iteration 12 energy 0 : -141958 energy 1 : -160851
E-Slope 0.00249425 iteration 13 energy 0 : -141267 energy 1 : -159071
E-Slope 0.00148697 iteration 14 energy 0 : -142025 energy 1 : -160982
E-Slope 0.000953488 iteration 15 energy 0 : -141431 energy 1 : -159270
E-Slope 0.00063476 iteration 16 energy 0 : -142094 energy 1 : -161079
E-Slope 0.000407696 iteration 17 energy 0 : -141523 energy 1 : -159374
E-Slope 0.000344731 iteration 18 energy 0 : -142134 energy 1 : -161142
E-Slope 0.000214309 iteration 19 energy 0 : -141570 energy 1 : -159430
E-Slope 0.000251487 iteration 20 energy 0 : -142162 energy 1 : -161187
E-Slope 0.000121891 iteration 21 energy 0 : -141599 energy 1 : -159459
E-Slope 0.000197537 iteration 22 energy 0 : -142176 energy 1 : -161214
E-Slope 6.56694e-05 iteration 23 energy 0 : -141617 energy 1 : -159490
Converged due to oscillation in optimization metric 0 bad 8 metric 1 bad 9
this->m_ScaleFactor 1 nlev 4 curl 3
Its at this level 20
iteration 1 energy 0 : -676035 energy 1 : -906499
iteration 2 energy 0 : -760290 energy 1 : -972330
iteration 3 energy 0 : -790541 energy 1 : -1.00296e+06
iteration 4 energy 0 : -804945 energy 1 : -1.01094e+06
iteration 5 energy 0 : -810286 energy 1 : -1.01914e+06
iteration 6 energy 0 : -817562 energy 1 : -1.02207e+06
iteration 7 energy 0 : -817989 energy 1 : -1.02649e+06
iteration 8 energy 0 : -823568 energy 1 : -1.02776e+06
iteration 9 energy 0 : -822525 energy 1 : -1.03071e+06
iteration 10 energy 0 : -826403 energy 1 : -1.0307e+06
iteration 11 energy 0 : -824915 energy 1 : -1.03299e+06
iteration 12 energy 0 : -828248 energy 1 : -1.03257e+06
E-Slope 0.00314144 iteration 13 energy 0 : -826693 energy 1 : -1.03485e+06
E-Slope 0.00194723 iteration 14 energy 0 : -829541 energy 1 : -1.03388e+06
E-Slope 0.00131755 iteration 15 energy 0 : -827668 energy 1 : -1.03603e+06
E-Slope 0.00101458 iteration 16 energy 0 : -830313 energy 1 : -1.03471e+06
E-Slope 0.000717409 iteration 17 energy 0 : -828226 energy 1 : -1.03684e+06
E-Slope 0.000610508 iteration 18 energy 0 : -830754 energy 1 : -1.03526e+06
E-Slope 0.000405386 iteration 19 energy 0 : -828171 energy 1 : -1.03711e+06
E-Slope 0.00037577 iteration 20 energy 0 : -830898 energy 1 : -1.03557e+06
tired convergence: reached max iterations
Registration Done
begin writing /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0.xfm
writing /data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0 def
Writing:/data/chamal/projects/gabriel/transgentic_rat_build_model/MBM-test-average/avgmodel_processed/A30FISP/tmp/A30FISP-resampled-lsq12_to_avgmodel-nlin-1_nlin_0_inverse_grid_0.mnc
That log file does not indicate anything went wrong while processing that stage.
We've created a pre-release (Rizzo-alpha) which I would suggest you move over to. The changes made to the code should give us a better indication as to why this is happening.
Is this issue still affecting anyone? Or has it been fixed in the development branch?
I think that the fix for this bug introduced bug #172
Closing for now; reopen if necessary.
I'm running MBM on a SGE cluster. My run submission was;
The jobs executors seem to get paired down to the number of input files (25) and process away.
For some reason, one of the non-linear registrations too much longer than the others to complete, in that time, it looks like the other executors died off.
When the longer-running registration completed, no new executors were spawned, and the pipeline was left stranded with nothing to do the work.