Need option to serialize SIM and FIT jobs to avoid batch queue overflow

RickKessler commented 3 years ago

If we allocate 80 cores per sim job and 10 GENVERSIONs, the 800 jobs overflows the 500 max on Midway, and thus 300 of the 800 jobs are never launched. Would be really useful to specify a Pippin flag to run each SIM job serially to avoid the overflow problem.

djbrout commented 3 years ago

This is supposed to be avoided by default and I know Sam had some issues over the past year dealing with this. I will take a crack at it but can’t guarantee I’ll be able to fix it quickly. In the meantime can you just request 50 cores per sim job? That technically is more efficient and will finish faster anyways I think.

On Mon, Dec 28, 2020 at 4:11 PM RickKessler notifications@github.com wrote:

If we allocate 80 cores per sim job and 10 GENVERSIONs, the 800 jobs overflows the 500 max on Midway, and thus 300 of the 800 jobs are never launched. Would be really useful to specify a Pippin flag to run each SIM job serially to avoid the overflow problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Samreay/Pippin/issues/26, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABJHI2ICT772OFSG4LNBIULSXDX7NANCNFSM4VMPMYKQ .

Samreay commented 3 years ago

Just chiming in that I feel like the real issue here is the SNANA limitation of 1CPU=1job. With the new submit batch jobs in python, you could put in a simple MPI (an example of using open MPI in python to distribute tasks on single CPU cores can be found in my code here: https://github.com/Samreay/Barry/blob/master/barry/precompute_mpi.py) job such that you could act exactly as it does right now, just with one layer of abstraction. A better idea would be using MPI to divy up each of the tasks (generating 1000 light curves at a time or similar until the jobs are done), so that we use all the cores instead of having some cores (like those given the Ia sample) finishing in a third the time the CPUs that get assigned the CC sample to simulate.

On Mon, Dec 28, 2020 at 9:55 PM djbrout notifications@github.com wrote:

This is supposed to be avoided by default and I know Sam had some issues over the past year dealing with this. I will take a crack at it but can’t guarantee I’ll be able to fix it quickly. In the meantime can you just request 50 cores per sim job? That technically is more efficient and will finish faster anyways I think.

On Mon, Dec 28, 2020 at 4:11 PM RickKessler notifications@github.com wrote:

If we allocate 80 cores per sim job and 10 GENVERSIONs, the 800 jobs overflows the 500 max on Midway, and thus 300 of the 800 jobs are never launched. Would be really useful to specify a Pippin flag to run each SIM job serially to avoid the overflow problem.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Samreay/Pippin/issues/26, or unsubscribe < https://github.com/notifications/unsubscribe-auth/ABJHI2ICT772OFSG4LNBIULSXDX7NANCNFSM4VMPMYKQ

.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Samreay/Pippin/issues/26#issuecomment-751876169, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABTPSWOJFALVPESSCIWGFDLSXD5D7ANCNFSM4VMPMYKQ .

dessn / Pippin

Need option to serialize SIM and FIT jobs to avoid batch queue overflow #26