Open jamesu opened 4 years ago
Hi, have you confirmed there's nothing in top/htop spinning on CPU?
What's the last entry in pipeline.log?
The only thing still "spinning" are a series of forked MBM.py scripts which appears to be doing nothing. i.e.:
3027 docker 20 0 2646476 68436 4884 S 0.0 0.4 0:00.00 MBM.py
3501 docker 20 0 931948 102324 34028 S 0.0 0.6 0:03.91 MBM.py
3524 docker 20 0 2947928 84476 14504 S 0.0 0.5 0:56.97 MBM.py
3525 docker 20 0 931948 73036 4740 S 0.0 0.5 0:02.15 MBM.py
3531 docker 20 0 1267860 72824 4248 S 0.0 0.5 0:00.00 MBM.py
3532 docker 20 0 1267860 72824 4248 S 0.0 0.5 0:00.00 MBM.py
3537 docker 20 0 1562820 73640 4960 S 0.0 0.5 0:00.05 MBM.py
3538 docker 20 0 1562820 73640 4960 S 0.0 0.5 0:00.03 MBM.py
3539 docker 20 0 1562820 73644 4960 S 0.0 0.5 0:00.05 MBM.py
3541 docker 20 0 1571016 73648 4960 S 0.0 0.5 0:00.01 MBM.py
3549 docker 20 0 2152692 73396 4560 S 0.0 0.5 0:00.00 MBM.py
3550 docker 20 0 2152692 73812 4960 S 0.0 0.5 0:00.25 MBM.py
Last few entries in the pipeline.log are:
[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] Running stage 550 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115). Memory requested: 1.75
[2019-10-28 14:36:10.526,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_labels.mnc mbm_new_images_atlases/Dorr_2008_labels/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled_labels.mnc
[2019-10-28 14:36:17.474,pydpiper.execution.pipeline,INFO] Stage 548 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 548
FIN: Starting Stage 442
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] Running stage 442 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115). Memory requested: 1.75
[2019-10-28 14:36:17.481,pydpiper.execution.pipeline,INFO] mincresample -clobber -2 -nearest_neighbour -invert -keep_real_range -labels -transform mbm_new_images_processed/DW_Recon.8_dti/masking/transforms/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5.xfm -like mbm_new_images_processed/DW_Recon.8_dti/resampled/DW_Recon.8_dti_N_I_lsq6.mnc /execute/test-data/ex-vivo-atlases/Dorr_2008_mask.mnc mbm_new_images_atlases/Dorr_2008_mask/tmp/DW_Recon.8_dti_N_I_lsq6_mt_to_Dorr_2008_average_nlin_5-resampled.mnc
[2019-10-28 14:36:57.423,pydpiper.execution.pipeline,INFO] Stage 550 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 550
[2019-10-28 14:37:04.065,pydpiper.execution.pipeline,INFO] Stage 442 finished, return was: 0 (on PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115)
FIN: Finished Stage 442
[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Main loop finished
[2019-10-28 14:38:04.075,pydpiper.execution.pipeline,INFO] Unsetting the registered-with-the-server flag for executor: PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Client un-registered (Cheers!): PYRO:obj_51bd7590f43e432e9ae01c6e5ee2b582@127.0.0.1:43115
[2019-10-28 14:38:04.076,pydpiper.execution.pipeline,INFO] Done calling unregisterClient
[2019-10-28 14:38:04.083,pydpiper.execution.pipeline,INFO] Executor shutting down.
With the command in this case being:
/usr/local/bin/MBM.py --pipeline-name mbm_new_images --config-file=/execute/test-data/sample.cfg --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --registration-method=minctracc --maget-registration-method=minctracc --num-executors=5 --lsq6-large-rotations --lsq6-large-rotations-parameters=mousebrain --maget-atlas-library=/execute/test-data/ex-vivo-atlases --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --maget-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --maget-masking-nlin-protocol=/execute/test-data/default_nlin_MAGeT_minctracc_prot.csv --proc 2 --files /execute/test-data/dti-images/DW_Recon.15_dti.mnc /execute/test-data/dti-images/DW_Recon.5_dti.mnc /execute/test-data/dti-images/DW_Recon.6_dti.mnc /execute/test-data/dti-images/DW_Recon.7_dti.mnc /execute/test-data/dti-images/DW_Recon.8_dti.mnc /execute/test-data/dti-images/DW_Recon.9_dti.mnc
Checking on the whole run up to that point, it looks like stage 2 didn't complete if I do a count of the completion messages. This seems to have caused some other stages like 3 and beyond not to run at all, while others presumably not dependent on stage 3 have run.
ALSO: pipeline state
STALLED PIPELINE STATE:
Total number of stages in the pipeline: 1761
Number of stages already processed: 344
Number of active clients: 2
Number of clients waiting in the queue: 0
Currently running stages (3):
0 mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.15_dti.mnc mbm_new_images_processed/DW_Recon.15_dti/tmp/DW_Recon.15_dti_fwhm0.28 -gradient
3 mincblur -clobber -no_apodize -fwhm 0.28 /execute/test-data/dti-images/DW_Recon.5_dti.mnc mbm_new_images_processed/DW_Recon.5_dti/tmp/DW_Recon.5_dti_fwhm0.28 -gradient
110 mincblur -clobber -no_apodize -fwhm 0.25 /execute/test-data/ex-vivo-atlases/Dorr_2008_on_VVW_Sert_v1_average.mnc mbm_new_images_atlases/Dorr_2008_on_VVW_Sert_v1_average/tmp/Dorr_2008_on_VVW_Sert_v1_average_fwhm0.25 -gradient
Number of runnable stages: 0
Number of failed stages: 0
Number of failed/lost/dead executors: 0
Memory requirement of runnable stages: []
Memory available in registered clients: [3.5, 3.5]
Hi,
If you're running locally, why is your config "queue-type sge". You need to run MBM.py with "--local"
I was under the impression that the "local" option in the config set the run mode to local. In this particular case to evaluate I am just using the sample configuration. Should I be using a different queue type in this case?
Many thanks for your assistance so far.
All these settings:
queue-type sge
queue-name all.q
proc 1
mem 10
time-to-accept-jobs 180
Refer to a cluster config and should be removed for your application.
This:
[2019-10-28 14:38:04.074,pydpiper.execution.pipeline,WARNING] Exceeded allowed idle time ... bye!
May be a bug triggered by time-to-accept-jobs 180
which shouldn't apply in the local case, I leave that for the developers to determine.
Looking at the code, is_time_to_drain
uses time-to-accept-jobs
. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542 - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs
has been exceeded the executor will never exit.
There is also another check before that against max-idle-time
which actually kills the executor if it has been exceeded.
I'll see if removing time-to-accept-jobs
and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.
Thanks again.
Note that the 180 is seconds, not minutes.
We rarely use this configuration so it's certainly possible some other issue is at play.
From: James Urquhart notifications@github.com Sent: October 28, 2019 2:50 PM To: Mouse-Imaging-Centre/pydpiper pydpiper@noreply.github.com Cc: Subscribed subscribed@noreply.github.com Subject: Re: [Mouse-Imaging-Centre/pydpiper] Local pipeline randomly stalls (#441)
Looking at the code, is_time_to_drain uses time-to-accept-jobs. Seems to be checked here during the main loop: https://github.com/Mouse-Imaging-Centre/pydpiper/blob/master/pydpiper/execution/pipeline_executor.py#L542https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_blob_master_pydpiper_execution_pipeline-5Fexecutor.py-23L542&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=0YWS8kG4qvE2iJGvGqPbjMIL5uFry9OJ1yQjhvGKvZI&e= - in which case, it will always return True which causes the loop to continue running (assuming I followed through right). But this seems to be done before grabbing the next command, so as long as time-to-accept-jobs has been exceeded the executor will never exit.
There is also another check before that against max-idle-time which actually kills the executor if it has been exceeded.
I'll see if removing time-to-accept-jobs and the like from my config improves the situation, though I am still a little suspicious there may be some other variable at play here as sometimes the stall occurs well before 180 minutes has passed.
Thanks again.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_Mouse-2DImaging-2DCentre_pydpiper_issues_441-3Femail-5Fsource-3Dnotifications-26email-5Ftoken-3DAABICNBSKY4GZLO3W742G33QQ4YBDA5CNFSM4JFXDADKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECN7ZKY-23issuecomment-2D547093675&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=GB7tfltbdxpOcPnSFOIs0cgB4cs9vtfZGHX1-RR3GuE&e=, or unsubscribehttps://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AABICNDD6YIJQHP4M5ETNUDQQ4YBDANCNFSM4JFXDADA&d=DwMCaQ&c=Sj806OTFwmuG2UO1EEDr-2uZRzm2EPz39TfVBG2Km-o&r=WbPKw40NU3g_RTKn7pWL3cSAdk6QRKr3kMreWPZzNcg&m=jOAcj6rDkHBQK74MZa-W_cr1gnBfCn7gSzZyDQ6I3J4&s=hB4dQg8g23s2RkH2l845_iVj0oQUnHjGjebyEVgosp0&e=.
This e-mail may contain confidential, personal and/or health information(information which may be subject to legal restrictions on use, retention and/or disclosure) for the sole use of the intended recipient. Any review or distribution by anyone other than the person for whom it was originally intended is strictly prohibited. If you have received this e-mail in error, please contact the sender and delete all copies.
Hi @jamesu,
I've had this problem when I've been launching MBM.py myself (for example the pipeline misses some early mincblur stages). For local runs, I just have a config file containing only [string] and specify --mem and --num-executors in the MBM.py command line call.
As Pydpiper automatically saves which stage you are at, I've just been relaunching the command and that will fix the problem (I'm sorry that its tedious).
Kind Regards, Kyle Drover
Ah, good to hear someone else is having this issue. Indeed, you can keep restarting the pipeline but it seems a little silly to me we're having to resort to working around the problem for a tool which should be doing the job for us.
For reference, on my endtime-to-accept-jobs
didn't seem to fix the problem, I can only assume there must be some edge-case issue being triggered here. I'll see if I can get any more info on the state of the processes on a future run.
How long are your filenames?
I think that might have an effect on whether stages get missed or not but I haven't tested it out properly (I stumbled upon this while trying to correct an incorrect filename been written).
@Dorky-Lever no longer than the filenames used in the test brain files in this case
Hi James,
See if specifying more RAM (if possible) and reducing the number of executors will cause it to stall less. I've noticed that reduces some stalling.
So when I'm doing a local run with MBM.py:
I have the h_vmem specified to 60gb in my shell and --mem specified to 300gb (I'm running the code on a subnode with 1000gb of memory.
MBM.py --pipeline-name Hen_shrt --mem 300 --num-executors 40 \ --init-model /home/kyle_arkell/Henrik/TCP-1700003F12Rik/initial-model/E15.5_mouse_model_final_image.mnc \ --registration-method minctracc --registration-strategy pairwise_and_build_model \ --nlin-protocol /home/kyle_arkell/embryo_minctracc_protocol.csv \ --no-run-maget --maget-no-mask --no-common-space-registration --verbose --config-file /home/kyle_arkell/arkell.cfg \ --lsq12-protocol /home/kyle_arkell/Henrik/TCP-1700003F12Rik/MBM_lsq12_embryo_protocol.csv \ --output-dir /home/kyle_arkell/Hen_files/ --files /home/kylearkell/Henrik/TCP-1700003F12Rik/m*mnc
I've been trying to run
MBM.py
as part of thetest_MBM_and_MAGeT.py
test script and also separately with some other mouse brain images I have.What seems to happen is randomly the pipeline will completely stall. Often I will see a set of MBM.py scripts still running, but doing nothing... even past the default 1 minute idle time. I then need to restart the pipeline to progress further. Sometimes this even happens right at the start of a run (after ~8 stages). The problem seems to get worse the more executors I use.
Is there any easy way to debug this, or are there better default config options I can try to mitigate the problem?
Usually I run with some variant of this (with csv files from
test-data.tar.gz
):MBM.py --pipeline-name=MBM_latest_test --num-executors=3 --verbose --init-model=/execute/test-data/Pydpiper-init-model-basket-may-2014/basket_mouse_brain.mnc --config-file=/execute/test-data/sample.cfg --lsq6-large-rotations --no-run-maget --maget-no-mask --lsq12-protocol=/execute/test-data/default_linear_MAGeT_prot.csv --files <mnc file list>
But I've also noticed this issue with full runs (with MaGET). Also my configuration is just the sample local test config, i.e.