celerity / slurmactiond

Schedule GitHub Actions jobs on a cluster through SLURM
MIT License
7 stars 0 forks source link

Poll GitHub API instead of using webhooks #14

Open pzehner opened 5 days ago

pzehner commented 5 days ago

This project sounds very interesting for the needs of my lab. Unfortunately, opening a port plus having a reverse-proxy running on a machine that submits SLURM jobs is not really an option for me. As an alternative to using a webhook, would it be possible to poll the GitHub API instead?

I dug a little bit, sounds like you get the worflows of a repository with:

https://api.github.com/repos/USER/NAME/actions/workflows

then you get the latest runs of a workflow with:

https://api.github.com/repos/USER/NAME/actions/workflows/WORKFLOW_ID/runs

filter the queuing runs, then you get the jobs of a run with:

https://api.github.com/repos/USER/NAME/actions/runs/RUN_ID/jobs

filter the self-hosted jobs and you get the expected labels for the runner to spawn.

I'd love to contribute to this, but I'm not familiar with Rust.

fknorr commented 5 days ago

It would certainly be possible to poll Github periodically, but I would be concerned about running into API rate limits unless we allow rather high latency between a CI job triggering and slurmactiond launching a job.

Have you considered running the reverse proxy on a separate, publicly reachable server and forwarding a port to the SLURM headnode / slurmactiond host through your local network?

pzehner commented 5 days ago

It would certainly be possible to poll Github periodically, but I would be concerned about running into API rate limits unless we allow rather high latency between a CI job triggering and slurmactiond launching a job.

I checked the documentation, unauthenticated fetches are limited to 60 per hour, authenticated fetches to 5000 per hour, and authenticated fetches with Enterprise Cloud to 15000 per hour.

As you need 3 fetches per poll, in the worst case you are limited to 20 fetches per hour (every 3 minutes), which sounds reasonable.

Have you considered running the reverse proxy on a separate, publicly reachable server and forwarding a port to the SLURM headnode / slurmactiond host through your local network?

I have to check this option, but I'm afraid it may not be possible in my case.

fknorr commented 5 days ago

I took a look at the code again this afternoon and doing what you propose is unfortunately a bit more complex than adding a timer loop around the update function (which I hoped could suffice), because we also need to know about jobs completing / failing, not just new ones arriving. Unfortunately I do not have the capacity at the moment to make this happen.

Nonetheless, having a polling feature would be neat even when a webhook is present, because webhook updates can spuriously drop due to a bad network connection. For these setups, a polling period of ~10 minutes or similar would suffice.

For reference: slurmactiond maintains a Scheduler state machine which keeps track of the active jobs and runners and triggers state transitions when an webhook event arrives to signal a job update. For a polling-based approach, we need to periodically poll each job (and trigger state updates where detected) and also query for new and unassigned jobs. There probably needs to be a separate module next to webhook.rs for these periodic jobs.

pzehner commented 5 days ago

Extending an event-driven paradigm with a polled-state one is not trivial indeed.

I guess your current Scheduler could be re-used. As I said, I'd propose a PR for this if it were in Python or C++…