jamesu commented 4 years ago

I've been trying to run MBM.py as part of the test_MBM_and_MAGeT.py test script and also separately with some other mouse brain images I have.

What seems to happen is randomly the pipeline will completely stall. Often I will see a set of MBM.py scripts still running, but doing nothing... even past the default 1 minute idle time. I then need to restart the pipeline to progress further. Sometimes this even happens right at the start of a run (after ~8 stages). The problem seems to get worse the more executors I use.

Is there any easy way to debug this, or are there better default config options I can try to mitigate the problem?

Usually I run with some variant of this (with csv files from test-data.tar.gz):

MBM.py --pipeline-name=MBM_latest_test --num-executors=3 --verbose --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --config-file=/execute/test-data/sample.cfg --lsq6-large-rotations --no-run-maget --maget-no-mask --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --files <mnc file list>

But I've also noticed this issue with full runs (with MaGET). Also my configuration is just the sample local test config, i.e.

[sample]
queue-type sge
queue-name all.q
proc 1
mem  10
time-to-accept-jobs 180
local

gdevenyi commented 4 years ago

Hi, have you confirmed there's nothing in top/htop spinning on CPU?

What's the last entry in pipeline.log?

jamesu commented 4 years ago

The only thing still "spinning" are a series of forked MBM.py scripts which appears to be doing nothing. i.e.:

 3027 docker    20   0 2646476  68436   4884 S   0.0  0.4   0:00.00 MBM.py                                                                                                                                                                                                                              
 3501 docker    20   0  931948 102324  34028 S   0.0  0.6   0:03.91 MBM.py                                                                                                                                                                                                                              
 3524 docker    20   0 2947928  84476  14504 S   0.0  0.5   0:56.97 MBM.py                                                                                                                                                                                                                              
 3525 docker    20   0  931948  73036   4740 S   0.0  0.5   0:02.15 MBM.py                                                                                                                                                                                                                              
 3531 docker    20   0 1267860  72824   4248 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3532 docker    20   0 1267860  72824   4248 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3537 docker    20   0 1562820  73640   4960 S   0.0  0.5   0:00.05 MBM.py                                                                                                                                                                                                                              
 3538 docker    20   0 1562820  73640   4960 S   0.0  0.5   0:00.03 MBM.py                                                                                                                                                                                                                              
 3539 docker    20   0 1562820  73644   4960 S   0.0  0.5   0:00.05 MBM.py                                                                                                                                                                                                                              
 3541 docker    20   0 1571016  73648   4960 S   0.0  0.5   0:00.01 MBM.py                                                                                                                                                                                                                              
 3549 docker    20   0 2152692  73396   4560 S   0.0  0.5   0:00.00 MBM.py                                                                                                                                                                                                                              
 3550 docker    20   0 2152692  73812   4960 S   0.0  0.5   0:00.25 MBM.py

Last few entries in the pipeline.log are:

[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] Running stage 550 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115). Memory requested: 1.75
[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_labels.mnc mbm_new_images_atlases/Dorr_2008_labels/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled_labels.mnc
[2019-10-28 14:36:17.474,pydpiper.execution.pipeline,INFO] Stage 548 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 548
FIN: Starting Stage 442
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] Running stage 442 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115). Memory requested: 1.75
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -labels -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_mask.mnc mbm_new_images_atlases/Dorr_2008_mask/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled.mnc
[2019-10-28 14:36:57.423,pydpiper.execution.pipeline,INFO] Stage 550 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 550
[2019-10-28 14:37:04.065,pydpiper.execution.pipeline,INFO] Stage 442 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 442
[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Main loop finished
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Unsetting the registered-with-the-server flag for executor: PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Client un-registered (Cheers!): PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Done calling unregisterClient
[2019-10-28 14:38:04.083,pydpiper.execution.pipeline,INFO] Executor shutting down.

With the command in this case being:

/usr/local/bin/MBM.py --pipeline-name mbm_new_images --config-file=/execute/test-data/sample.cfg --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --registration-method=minctracc --maget-registration-method=minctracc --num-executors=5 --lsq6-large-rotations --lsq6-large-rotations-parameters=mousebrain --maget-atlas-library=/execute/test-data/ex-vivo-atlases --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --maget-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --maget-masking-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --proc 2 --files /execute/test-data/dti-images/DW_Recon.15_dti.mnc /execute/test-data/dti-images/DW_Recon.5_dti.mnc /execute/test-data/dti-images/DW_Recon.6_dti.mnc /execute/test-data/dti-images/DW_Recon.7_dti.mnc /execute/test-data/dti-images/DW_Recon.8_dti.mnc /execute/test-data/dti-images/DW_Recon.9_dti.mnc

Checking on the whole run up to that point, it looks like stage 2 didn't complete if I do a count of the completion messages. This seems to have caused some other stages like 3 and beyond not to run at all, while others presumably not dependent on stage 3 have run.

ALSO: pipeline state

STALLED PIPELINE STATE:

Total number of stages in the pipeline:  1761
Number of stages already processed:      344 

Number of active clients:                2
Number of clients waiting in the queue:  0 

Currently running stages (3): 
0   mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.15_dti.mnc mbm_new_images_processed/DW_Recon.15_dti/tmp/DW_Recon.15_dti_fwhm0.28 -gradient

3   mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.5_dti.mnc mbm_new_images_processed/DW_Recon.5_dti/tmp/DW_Recon.5_dti_fwhm0.28 -gradient

110 mincblur -clobber -no_apodize -fwhm 0.25 /execute/test-data/ex-vivo-atlases/Dorr_2008_on_VVW_Sert_v1_average.mnc mbm_new_images_atlases/Dorr_2008_on_VVW_Sert_v1_average/tmp/Dorr_2008_on_VVW_Sert_v1_average_fwhm0.25 -gradient

Number of runnable stages:                0 

Number of failed stages:                  0
Number of failed/lost/dead executors:     0 

Memory requirement of runnable stages: []
Memory available in registered clients: [3.5, 3.5]

gdevenyi commented 4 years ago

Hi,

If you're running locally, why is your config "queue-type sge". You need to run MBM.py with "--local"

jamesu commented 4 years ago

I was under the impression that the "local" option in the config set the run mode to local. In this particular case to evaluate I am just using the sample configuration. Should I be using a different queue type in this case?

Many thanks for your assistance so far.

gdevenyi commented 4 years ago

All these settings:

queue-type sge
queue-name all.q
proc 1
mem  10
time-to-accept-jobs 180

Refer to a cluster config and should be removed for your application.

This:

[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!

May be a bug triggered by time-to-accept-jobs 180 which shouldn't apply in the local case, I leave that for the developers to determine.

jamesu commented 4 years ago

Looking at the code, is_time_to_drain uses time-to-accept-jobs. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542 - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs has been exceeded the executor will never exit.

There is also another check before that against max-idle-time which actually kills the executor if it has been exceeded.

I'll see if removing time-to-accept-jobs and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.

Thanks again.

bcdarwin commented 4 years ago

Note that the 180 is seconds, not minutes.

We rarely use this configuration so it's certainly possible some other issue is at play.

From: James Urquhart notifications@github.com Sent: October 28, 2019 2:50 PM To: Mouse-Imaging-Centre/pydpiper pydpiper@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Mouse-Imaging-Centre/pydpiper] Local pipeline randomly stalls (#441)

Looking at the code, is_time_to_drain uses time-to-accept-jobs. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542 https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_blob_master_pydpiper_execution_pipeline-5Fexecutor.py-23L542&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=0YWS8kG4qvE2iJGvGqPbjMIL5uFry9OJ1yQjhvGKvZI&e= - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs has been exceeded the executor will never exit.

There is also another check before that against max-idle-time which actually kills the executor if it has been exceeded.

I'll see if removing time-to-accept-jobs and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.

Thanks again.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_issues_441-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAABICNBSKY4GZLO3W742G33QQ4YBDA5CNFSM4JFXDADKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECN7ZKY-23issuecomment-2D547093675&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=GB7tfltbdxpOcPnSFOIs0cgB4cs9vtfZGHX1-RR3GuE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AABICNDD6YIJQHP4M5ETNUDQQ4YBDANCNFSM4JFXDADA&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=hB4dQg8g23s2RkH2l845_iVj0oQUnHjGjebyEVgosp0&e=.

This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.

dorkylever commented 4 years ago

Hi @jamesu,

I've had this problem when I've been launching MBM.py myself (for example the pipeline misses some early mincblur stages). For local runs, I just have a config file containing only [string] and specify --mem and --num-executors in the MBM.py command line call.

As Pydpiper automatically saves which stage you are at, I've just been relaunching the command and that will fix the problem (I'm sorry that its tedious).

Kind Regards, Kyle Drover

jamesu commented 4 years ago

Ah, good to hear someone else is having this issue. Indeed, you can keep restarting the pipeline but it seems a little silly to me we're having to resort to working around the problem for a tool which should be doing the job for us.

For reference, on my endtime-to-accept-jobs didn't seem to fix the problem, I can only assume there must be some edge-case issue being triggered here. I'll see if I can get any more info on the state of the processes on a future run.

dorkylever commented 4 years ago

How long are your filenames?

I think that might have an effect on whether stages get missed or not but I haven't tested it out properly (I stumbled upon this while trying to correct an incorrect filename been written).

jamesu commented 4 years ago

@Dorky-Lever no longer than the filenames used in the test brain files in this case

dorkylever commented 4 years ago

Hi James,

See if specifying more RAM (if possible) and reducing the number of executors will cause it to stall less. I've noticed that reduces some stalling.

So when I'm doing a local run with MBM.py:

I have the h_vmem specified to 60gb in my shell and --mem specified to 300gb (I'm running the code on a subnode with 1000gb of memory.

!/bin/bash

$ -N "test_kat_crop"

$ -cwd

$ -V

$ -S /bin/bash

$ -q "hugemem.q"

$ -l h_vmem=60g,virtual_free=59.9g

$ -pe threads 40

$ -m e

$ -M "kyle.drover@anu.edu.au"

MBM.py --pipeline-name Hen_shrt --mem 300 --num-executors 40 \ --init-model /home/kyle_arkell/Henrik/TCP-1700003F12Rik/initial-model/E15.5_mouse_model_final_image.mnc \ --registration-method minctracc --registration-strategy pairwise_and_build_model \ --nlin-protocol /home/kyle_arkell/embryo_minctracc_protocol.csv \ --no-run-maget --maget-no-mask --no-common-space-registration --verbose --config-file /home/kyle_arkell/arkell.cfg \ --lsq12-protocol /home/kyle_arkell/Henrik/TCP-1700003F12Rik/MBM_lsq12_embryo_protocol.csv \ --output-dir /home/kyle_arkell/Hen_files/ --files /home/kylearkell/Henrik/TCP-1700003F12Rik/m*mnc

Mouse-Imaging-Centre / pydpiper

Local pipeline randomly stalls #441

!/bin/bash

$ -N "test_kat_crop"

$ -cwd

$ -V

$ -S /bin/bash

$ -q "hugemem.q"

$ -l h_vmem=60g,virtual_free=59.9g

$ -pe threads 40

$ -m e

$ -M "kyle.drover@anu.edu.au"