emews / EQ-SQL

1 stars 0 forks source link

Formalize Swift-T dual ME / job submission? #38

Open ncollier opened 1 year ago

ncollier commented 1 year ago

Being able to submit an ME and worker pool in one swift-t job submission where the submission script makes two calls to srun & and waits has been useful for the malaria model on Midway. Do we want to formalize this?

jozik commented 1 year ago

Great idea. With hybrid resources, could the srun calls be differently configured too?

j-woz commented 1 year ago

Is this a matter of injecting something in the submit script? Could be a good use of the Swift/T/PSI/J integration.

ncollier commented 1 year ago

Is this a matter of injecting something in the submit script? Could be a good use of the Swift/T/PSI/J integration.

Yes, I think so. For the malaria model on Midway3 that I mentioned above, I made alternate submission scripts that ultimately called srun twice in a single sbatch file -- a bit like this: https://www.hlrn.de/doc/display/PUB/Multiple+concurrent+programs+on+a+single+node, but allocating multiple nodes and a large PPN for the swift code that did the model runs, and a single node and a PPN of 1 for the Python ME / analysis code.

The issue I was trying to solve was that the model runs produce very large sqlite databases containing the results. These are large enough that we can't keep 100s of them around on the midway3 file systems without blowing past the PI's disk quota. Consequently, we produce the summarized statistics from them on-line as part of the workflow, and then delete the sqlite file. However, the analysis code requires a large amount of memory to read in the entire database. So, we wanted to run the analysis code in its own node with an appropriate configuration that was different from that of the swift work.

This does use an earlier version of the EQ/SQL code to communicate between the swift and Python ME code, and running them in the same job is convenient -- the Python code doesn't need to wait for the worker pool to start in the queue etc.