TIGRLab / air-tigrs

Orchestration tools for TIGRLab's data management infrastructure
1 stars 5 forks source link

[ENH] SlurmOperator base-class #18

Open jerdra opened 2 years ago

jerdra commented 2 years ago

A SlurmOperator base-class would be essential for our more intensive computing jobs (i.e bids-apps). However there are a couple of things we'd need to mange when working with this class:

  1. Monitoring of submitted job to get task success/failure. This could be done using an Operator which just submits the job and returns success when submission to slurm is successful.
  2. And a sensor downstream to monitor the task progress using PySlurm to communicate with the scheduler. We'd probably want to use Airflow SmartSensors here to avoid eating up a worker
  3. This could also just be a mix-in class so that we can generalize slurm submission to any other operators we'd build out
josephmje commented 2 years ago

This could be useful: https://git.astron.nl/eosc/slurmexecutorplugin

jerdra commented 2 years ago

I've seen this earlier while i was researching a solution for this problem. I think the one thing that bothered me about it was a lack of flexibility on how jobs are submitted.

For example, having some low compute jobs run locally on the airflow server and having others be submitted to the queue when they are compute intensive.

Actually, now that I think about it a bit more, a possible solution is if there could be a slurm partition that specifically submits back to tigrsrv for those very low-compute jobs. We could pass an executor_config to modify how the jobs are submitted.

The one downside is that its not as portable as having a hybrid model (i.e extend the actual LocalExecutor w/Slurm submission capability).

The plugin you shared is pretty barebones (i.e doesn't implement executor_config), but that could be an opportunity to build off of it

Thoughts from @DESm1th @kimjetwav?