alexhsamuel / apsis

General-purpose scheduler.
Other
3 stars 3 forks source link

Procstar ECS program #333

Open gusostow opened 5 months ago

gusostow commented 5 months ago

The plan we discussed a while ago:

The apsis scheduler launches jobs as ECS tasks, which are basically just containers. The running process in the container is still mediated by procstar: the container CLI command contains the json spec, then procstar and the ECS task terminate once the job process exits.

We said, the scheduler would poll and signal the job via the ECS API, rather than using procstar itself. But now I'm wondering if it's reasonable to lean more on procstar for tracking and controlling jobs instead of ECS.

Is it possible to launch a one-off procstar process then connect to it via the apsis scheduler's websocket server? I'm not seeing a straightforward way to retrieve the connection for the new procstar agent as-is.

alexhsamuel commented 5 months ago

I don't remember the plan involving polling via ECS. I think using the procstar agent's websocket communication mechanism is preferable, as long as the ECS instance routes back to Apsis.

Could we start by discussing requirements?

The procstar agent connects to Apsis (not vice versa), which is why routing has to be enabled. There is no need to have multiple procstar agents on one ECS instance (assuming single user) since one agent can handle multiple processes.

One way to associate a particular Apsis run with a particular procstar agent is via the group ID; you can make up as many of these as you want, and they may be ephemeral. But I'm not sure this is a good idea—you probably want some abstraction between your job configs and specific ECS instances. A job probably wants to specify a username (or some other auth signifier) and possibly some resource info like ECS instance type, and leave it to Apsis to arrange for a new or existing ECS instance to run on, right?

gusostow commented 5 months ago

I think might've misremembered a decision to poll with ECS, especially because it's so inferior to procstar websocket updates.

To clarify the ECS setup, ECS is an orchestration tool, like Kubernetes. ECS will worry about managing the underlying EC2 instances, so the apsis scheduler only needs to think in terms of launching containerized processes with specific cpu/memory requirements.

For the purposes of my question about launching one-off procstar agents, we don't need to worry about orchestration, EC2, or containers. The relevant challenge could be reduced to the apsis scheduler launching per-job procstar agents as subprocesses on the same host.

Would you want to run multiple runs on a single ECS + procstar instance? This probably is more economical than one-run-per-ECS.

Multiple runs per EC2 instance, yes, but that will be abstracted away. I want to avoid multiple jobs per procstar agent though because container-based resource/filesystem isolation is valuable to have. Additionally, long-lived procstar agents pose a code delivery challenge because they will only be able to run jobs from the environment in their container image, which would get stale fast. Ideally, jobs would run on a dedicated, one-off procstar agent that disconnects when the job finishes.

One way to associate a particular Apsis run with a particular procstar agent is via the group ID; you can make up as many of these as you want, and they may be ephemeral. But I'm not sure this is a good idea—you probably want some abstraction between your job configs and specific ECS instances.

Ok cool, I think using a per-job group-id should work, assuming it doesn't hurt apsis to have a lot of groups and that they can be deregistered.

I believe with those clarifications, what I'm proposing should provide the abstraction that you're encouraging.

alexhsamuel commented 5 months ago

Oh yeah sorry, s/ECS/EC2/ in my response, regardless of how you kick them off.

Also, I realize that in this case we don't have to confuse matters with group IDs; each agent also has a connection ID. Normally the agent chooses it randomly (so that if the agent restarts, we notice this), but in the ECS case we can just assign the connection ID. You can also specify the process spec at agent startup, so that it starts the proc immediately.

So we need a program that,

  1. Kicks off an ECS procstar instance with a unique connection ID and the proc spec.
  2. Waits for a connection to be established with this connection ID.
  3. Waits for the connection to return proc results.
  4. Deletes the proc from the instance.

The agent can be configured to shut down automatically once the proc is deleted. If you can configure the container to stop as soon as procstar terminates, we're all set for lifecycle in the correct case. I suppose there still needs to be some monitoring to report and clean up orphaned containers etc.

gusostow commented 5 months ago

Nice, sounds like a plan! How do you assign a custom connection ID?

alexhsamuel commented 5 months ago

https://github.com/alexhsamuel/procstar/blob/main/src/argv.rs#L65-L66

I'll un-hide this option.

gusostow commented 5 months ago

To prevent leaking agents, it would be useful if we could tell them to shut down after their job completes.

I first looked at using procstar --exit SPEC but it's troublesome because the job might finish before the scheduler gets results.

So I think we'd either need to

The scheduler can also clean up the agents when jobs finish but I have a feeling that self-shutdown will prevent a lot.

gusostow commented 5 months ago

The agent can be configured to shut down automatically once the proc is deleted. If you can configure the container to stop as soon as procstar terminates, we're all set for lifecycle in the correct case.

Nevermind, I missed this. Using --wait should be fine.