adjtomo / seisflows

An automated workflow tool for full waveform inversion and adjoint tomography
http://seisflows.readthedocs.org
BSD 2-Clause "Simplified" License
179 stars 122 forks source link

New system sub-class that prioritizes long queue times and large jobs #118

Open bch0w opened 2 years ago

bch0w commented 2 years ago

Following discussions with the Princeton group, it would be great to create a system class that prioritizes long queue times and large jobs over arrayed jobs. SeisFlows currently submits N array jobs (where N is the number of events used) on the system, which may take an appreciable amount of time as each job must be scheduled separately. If queue times are long on the system, wait times may be high.

One approach to fix this would be to submit one large job where each of the N tasks is doled out on the compute node itself (as opposed to distributing jobs as arrays from the master job). This could be contained within a separate 'qcluster' (q for queue) system module which has some internal logic to dole out these tasks after job submission, perhaps taking advantage of asyncio or a ThreadPoolExecutor from concurrent.futures.

bch0w commented 1 year ago

The NUMBER_OF_SIMULTANEOUS_RUNS parameter is available in all versions of SPECFEM and would be a useful target for this issue. It allows a User to submit one large job for N events, each event running on P processors. Rather than submitting N array jobs, each running on N cores, the User submits one job on NxP cores, and internally SPECFEM will distribute the job.

I need to test this capability and see what the finer details are, but I think SeisFlows can take advantage of this capability to submit large, long queue time, high core-number jobs.

bch0w commented 1 year ago

Notes on NUMBER_OF_SIMULTANEOUS_RUNS parameter (developing with Global code)

https://specfem3d-globe.readthedocs.io/en/latest/04_running_the_solver/#note-on-the-simultaneous-simulation-of-several-earthquakes

Outline on what will need to be changed: