ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

Separate simultaneous reproman jobs on same dataset? #580

Open jbwexler opened 3 years ago

jbwexler commented 3 years ago

Is there a recommended way to run two (or more) separate reproman jobs simultaneously on the same dataset? From a small bit of testing it seems only to be possible to start a second job if the first one is waiting in queue, allowing you to 'datalad save' and have a clean dataset. But this isn't always possible. I ask because TACC launcher jobs are billed as if all the sub-jobs ran for as long as the longest sub-job, thus we try not to have too many sub-jobs (usually max 10) in case one goes really long. So if a dataset has more than 10 subjects, I would like to be able to run multiple reproman jobs simultaneously...

kyleam commented 3 years ago

There's no support for concurrent top-level jobs for a given working tree. I think moving away from that would be a substantial rework of the current logic. One possibility would be to add an additional level of clones (probably with datalads --ephemeral option).

A potential workaround that might be worth trying is pointing the working_directory job parameter to a different remote directory for each job. Depending on how separated the input data is for each split job, that could be pretty lightweight. However, I think at least one reason that won't work at the moment is that, with the automatic naming scheme that's used for remotes, both jobs would try to use the resource name for the remote name. I don't think addressing that would be too involved, though.

jbwexler commented 3 years ago

Thanks! To clarify, would the solution involving changing the working_directory also require ephemeral clones or those are two separate solutions? I think generally the input data is pretty separate between jobs, only overlap are some small metadata files. I'm happy help address the automatic naming issue, unless you think it would just be easier to do yourself, whichever you prefer

kyleam commented 3 years ago

would the solution involving changing the working_directory also require ephemeral clones or those are two separate solutions?

Separate (or at least I hope)

I'm happy help address the automatic naming issue [...]

Thanks, that'd be great. The logic is in PrepareRemoteDataladMixin.prepare_remote, though I haven't given any thought to how specifically to handle this. I guess it would probably involve the ability to specify the remote name through a job parameter.

jbwexler commented 3 years ago

Sounds good, I'll probably take a look at it next week

jbwexler commented 3 years ago

I'm finally coming back to this. One question: I'm currently using datalad-no-remote as an orchestrator - will the proposed solution be compatible with this?

yarikoptic commented 3 years ago

sorry I am coming late to this discussion... I do not yet have a good idea on what would happen -- might be just easier to give it a shot and see how it fails (or succeeds) ;) Note that may be in your case you don't need anything "datalad" if you would be just pointing working_directory to the desired output location , run it them all in parallel on different "sub-arrays" of jobs and do one single final datalad save?

I wonder though -- instead of running multiple niceman instances in parallel (and like any "parallel anything" that would add extra complexity to avoid ways to screw things up), may be it would be feasible to provide some generic reproman batching mechanism where all jobs would be batched into "sub-arrays" and reproman would just schedule those "sub-arrays" of jobs individually, while "finishing up" only after all "sub-arrays" are complete. Not yet sure if that would be easier/better but just an idea which may be was discussed already and not worth talking about it again?

jbwexler commented 3 years ago

may be in your case you don't need anything "datalad"

Do you mean don't need anything "reproman"? I think one of the main reasons we were using reproman is because it solves a lot of the issues with running things in parallel with datalad. If that's not the case, then perhaps we don't need reproman?

may be it would be feasible to provide some generic reproman batching mechanism

This would be great, though I wonder how difficult it would be too implement this? This seems like a feature that could be built once we solve the issue we are discussing (running multiple reproman commands simultaneously on one dataset). Or do you think your idea would make solving this issue easier?