Open bch0w opened 2 years ago
The NUMBER_OF_SIMULTANEOUS_RUNS
parameter is available in all versions of SPECFEM and would be a useful target for this issue. It allows a User to submit one large job for N events, each event running on P processors. Rather than submitting N array jobs, each running on N cores, the User submits one job on NxP cores, and internally SPECFEM will distribute the job.
I need to test this capability and see what the finer details are, but I think SeisFlows can take advantage of this capability to submit large, long queue time, high core-number jobs.
Notes on NUMBER_OF_SIMULTANEOUS_RUNS
parameter (developing with Global code)
run????
directory that is not run0001 does not require a Par_file
broadcast_mesh_and_model
parameter). Only run0001
requires actual mesh and model files Par_file
but not CMTSOLUTION or STATIONS fileOutline on what will need to be changed:
Following discussions with the Princeton group, it would be great to create a system class that prioritizes long queue times and large jobs over arrayed jobs. SeisFlows currently submits N array jobs (where N is the number of events used) on the system, which may take an appreciable amount of time as each job must be scheduled separately. If queue times are long on the system, wait times may be high.
One approach to fix this would be to submit one large job where each of the N tasks is doled out on the compute node itself (as opposed to distributing jobs as arrays from the master job). This could be contained within a separate 'qcluster' (q for queue) system module which has some internal logic to dole out these tasks after job submission, perhaps taking advantage of asyncio or a ThreadPoolExecutor from concurrent.futures.