Mouse-Imaging-Centre / pydpiper

Python code for flexible pipeline control
Other
25 stars 10 forks source link

Filesystem-related scaling issues for large pipelines #461

Open gdevenyi opened 2 years ago

gdevenyi commented 2 years ago

We're currenly trying to submit a MAGeT.py pipeline to Niagara for processing.

MAGeT.py ends up spinning for ~2h on CPU time doing something, before being killed by Niagara for being bad on a login node. No jobs ever get to submission, and no other work is done.

Run command

MAGeT.py --verbose --pipeline-name=ASYN-long-20220121 \
--subject-matter mousebrain \
--files inputs/*lsq6.mnc --config-file niagara-maget.cfg --queue-type slurm

Config

[Niagara]
queue-type=slurm
min-walltime=86400
max-walltime=86400
max-idle-time=3600
time-to-accept-jobs=1380
ppn=40
proc=40
mem=188
num-executors=50
greedy=True
subject-matter=mousebrain
lsq12-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/linear/Pydpiper_default_lsq12_protocol.csv
atlas-library=/home/m/mchakrav/tulste/scratch/maget-merged-long-202201/atlas/
masking-method=ANTS
registration-method=ANTS
masking-nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level_MASKING.pl
nlin-protocol=/project/m/mchakrav/quarantine/2019b/pydpiper/protocols/CIC/Pydpiper_mincANTS_SyN_0.1_Gauss_2_1_40_micron_MAGeT_one_level.pl

The pipeline stages are generated:

-rw-r----- 1 gdevenyi mchakrav 237M 2022-01-24 13:19 ASYN-long-20220121_pipeline_stages.txt

However the log never goes beyond

[2022-01-24 13:20:43.835,pydpiper.execution.pipeline,INFO] Starting pipeline daemon...

Before being killed.

bcdarwin commented 2 years ago

I am running some tests now so I'm not exactly sure what part of the code is responsible yet, but note that MAGeT scales (for total operations -- it's not as bad if you only consider registrations) at least like number of atlases number of templates number of subjects, so probably reducing the number of templates is the easiest way to bring down the overall cost. My guess is that the overall issue is some combination of redundant file accesses via pyminc or creation of the output directories, but there are some CPU-limited parts as well which could also be optimized.

bcdarwin commented 2 years ago

Indeed, the majority of time appears to be spent in the output_directories and create_directories utility functions.

bcdarwin commented 2 years ago

At some point I added --defer-directory-creation which should help with the create_directories contribution but not output_directories -- the latter is maybe a case of os.path functions doing I/O ...

gdevenyi commented 2 years ago

--defer-directory-creation was able to get past and get to job submission

Which then failed with:

   7723 [2022-01-31 11:02:47.794,pydpiper.execution.pipeline,ERROR] Failed launching executors from the server.
   7724 Traceback (most recent call last):
   7725   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 825, in launchExecutorsFromServer
   7726     mem_needed=memNeeded, uri_file=self.exec_options.urifile)
   7727   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline.py", line 969, in launchPipelineExecutors
   7728     pipelineExecutor.submitToQueue(number=number)
   7729   File "/project/m/mchakrav/quarantine/2019b/pydpiper/2.0.13/install/lib/python3.6/site-packages/pydpiper-2.0.13-py3.6.egg/pydpiper/execution/pipeline_executor.py", line 440, in submitToQueue
   7730     raise SubmitError({ 'return' : p.returncode, 'failed_command' : submit_cmd })
   7731 pydpiper.execution.pipeline_executor.SubmitError: {'return': 1, 'failed_command': ['qbatch', '--chunksize=1', '--cores=1', '--jobname=ASYN-long-20220121-executor-2022-01-31-at-11-02-47', '-b', 'slurm',    7731 '--walltime=23:59:59', '-']}

Terminal said (should've been captured I think for the log? .SBATCH error: Batch job submission failed: Pathname of a file, directory or other parameter too long

We're retrying with --csv-file