Huge number of job submissions - Githubissues

CoBrALab / antsRegistration-MAGeT

A reimplementation of MAGeTbrain using only ANTs tools.

Other

16 stars 6 forks source link

Huge number of job submissions #95

Closed cfhammill closed 6 years ago

cfhammill commented 6 years ago

Hi Gabe, I'm trying to run antsRegistration MAGeT on the SickKids cluster, and I hit the 20,000 job hard limit on our system. Is this expected behaviour? It's ~1000 brains.

gdevenyi commented 6 years ago

Hi, this is expected. Unlike classic magetbrain, I designed this pipeline to produce the smallest possible pieces of independent work. This means independent atlas-template jobs named according to template and template-subject jobs split according to subject. Given 21 templates and 1000 subjects there's 21000 jobs at the template-subject stage.

The pipeline honours the qbatch environment variables on unknown (aka non compute Canada where I hard coded settings) clusters.

In particular, QBATCH_CHUNKSIZE will pack work into smaller numbers of longer running jobs, up to the splitting level of the stage (21 templates per subject job). You can set it higher but only that many jobs are generated into that chunk.

cfhammill commented 6 years ago

Thanks, that makes sense. I'm wondering now how I was able to use MAGeTBrain for such large runs. The new version is requesting 46 hours and ~27G per job, will take an eternity. Is it just the speed difference between ANTs and minctracc?

gdevenyi commented 6 years ago

Be careful of the interaction between PPJ, CORES and CHUNKSIZE. They respectively define the number of cpus per job, the number of commands to run in parallel per job, and the number of commands to pack into a job.

Those estimates do seem a bit high, are you running the latest release (or HEAD?) there may have been some... math errors in earlier time/memory estimations...

cfhammill commented 6 years ago

So far I haven't touched those variables, this is just a naive run, I'm uncertain if perhaps Ben has tweaked our qbatch config, but I doubt it.

And yes, I cloned yesterday. I'm using python 3.6 with some mild hacking to prevent the system python2 from being used (I symlinked python3 to maget/bin/python)

gdevenyi commented 6 years ago

Are there some python version issues? If so please open another issue :)

As for the job size and count bits, you can see exactly how things are estimate at https://github.com/CobraLab/antsRegistration-MAGeT/blob/master/bin/stages.sh#L48-L50

cfhammill commented 6 years ago

I'm not sure the python issues are general enough to warrant an issue, but I'll make one anyway, feel free to close if it's too site specific. Looking through the ABIDE files it looks like there are ~100 subjects with files 2-5x larger (in n voxels), maybe these are throwing off the estimates.

gdevenyi commented 6 years ago

Ah, that could definitely be an issue :)

The cutneck stage is really important to limit the number of voxels to improve processing times.

cfhammill commented 6 years ago

Ests down to 8G 21h :ok_hand:

gdevenyi commented 6 years ago

Great.

Will try and update README to be more explicit.