broadinstitute / cromshell

CLI for interacting with Cromwell servers
BSD 3-Clause "New" or "Revised" License
53 stars 15 forks source link

Feature request: "preempt" job (as opposed to "abort") #158

Open sjfleming opened 3 years ago

sjfleming commented 3 years ago

@dalessioluca might have input here as well.

I am looking for the ability to programmatically preempt a running job using a cromshell command. I do not want to abort the job. I want to kill the preemptible machine that the job is running on, and then let the workflow continue.

Why?

I would like to use this command as a way to test what happens to my preemptible jobs if they were to get preempted. The only way I know of to test this in the wild is something like this feature request.

What would this involve?

I don't know exactly how to do this, but it might be some variant of

gcloud compute instances stop VM_NAME

https://cloud.google.com/compute/docs/instances/stop-start-instance#gcloud

It would be great to have cromshell handle figuring out the name of the VM running the job.

What do I currently do to achieve the desired effect?

@dalessioluca taught me: you can use the Google Compute web UI to look for the instance running the Cromwell job based on the Cromwell workflow ID. If you can find the instance, you can use the UI to "stop" it, and this will preempt the job. The problem is, it requires you to have a few windows open, good timing, and a nonzero amount of effort.

jonn-smith commented 3 years ago

This is an interesting idea. @sjfleming How would this work for workflows with many tasks? Do you want to preempt one particular subtask or all subtasks?

@lbergelson What do you think?

lbergelson commented 3 years ago

It sounds like a niche but potentially useful idea. Not sure how to implement it since I'm assuming there's no hook in cromwell to do it. Cromshell doesn't have access to the underlying google infrastructure without some additional auth.

lbergelson commented 3 years ago

This really seems like an ideal cromshell 2 plugin rather than a base part of cromshell.

jonn-smith commented 3 years ago

I think Auth should be handled by your environmental settings (we're just calling out to gsutil / gcloud).

Seems like we'd have to parse the metadata to find out what's going on in the workflow and try to identify the machine from there. I'm not sure how to do that with the info in the metadata (or even if we can!) - I haven't looked. Assuming all the info is there it shouldn't really be an issue to do this…

👍 on the Cromshell 2.0 plugin idea, though we may be able to sneak this into this version if someone gets/makes time for it.

lbergelson commented 3 years ago

Yeah, I guess you're right that gsutil will just handle it. I was thinking that you would need access to the project that cromwell was running under, but in our case we have that.

sjfleming commented 2 years ago

@lbergelson yes it would be super niche... probably hardly anyone would ever use it

@jonn-smith yeah I think just calling gsutil is the only way to go here. But you bring up a great point about "which task"... I have been living in the simple world where my workflows are usually just one task. I guess what I'd really want is to be able to specify the task. The task is what I actually would want to preempt, since (probably) you're trying to target one thing which you want to resume upon preemption

sjfleming commented 2 years ago

Also, I guess this would only work with the google cloud backend at this point? I assume that's fine for now

sjfleming commented 2 years ago

I am watching a workflow's metadata as it's running, and I do NOT see the machine name (instanceName) until after the job completes. Is this expected? while the task is running, I see

"jes": {
          "executionBucket": "gs://broad-methods-cromwell-exec-bucket-instance-8",
          "endpointUrl": "https://genomics.googleapis.com/",
          "googleProject": "broad-dsde-methods"
        }

and after the task (or maybe workflow?) is complete, I see

"jes": {
          "endpointUrl": "https://genomics.googleapis.com/",
          "machineType": "custom-4-15360",
          "googleProject": "broad-dsde-methods",
          "executionBucket": "gs://broad-methods-cromwell-exec-bucket-instance-8",
          "zone": "us-west1-b",
          "instanceName": "google-pipelines-worker-3288067dcdcd9fb46a6e21bc7cc00311"
        }

(if there's no way to get the instanceName of a running task, this ruins the approach I had in mind...)

sjfleming commented 2 years ago

Maybe I would have to resort to something like

gcloud compute instances list --format="table(name,status,tags.list())"

and then comb through the tags, since Cromwell does add a tag with the workflow ID... but that seems less than ideal.

lbergelson commented 2 years ago

@sjfleming Are the vm's that papi manages actually exposed in any way to us? I hadn't really thought about it, but I don't even know if they're listed in your project?

sjfleming commented 2 years ago

@lbergelson they are! Yeah it's kinda cool actually, if we run something from the methods Cromwell server for example, then if you go to the Google Cloud Compute Engine Console in a browser, you can see that the Cromwell jobs are running on VMs whose names start with google-pipelines-worker-. And, very helpfully, they have labels with the Cromwell workflow_id and task. I don't know if every Cromwell instance applies those helpful labels to machines it spins up, or if that's some nice feature that somebody added to the methods Cromwell server.

If I can always count on those labels being there, then I can run this

gcloud compute instances list --filter='labels.cromwell-workflow-id:cromwell-{WORKFLOW_ID} labels.wdl-task-name:{TASK}' --format 'table(name)'

to get the name of the instance I want to stop, and then I can stop it with

gcloud compute instances stop {INSTANCE_NAME}