Workflow intermediate files not cleaned up on execution of workflows with delayed scheduling

pcm32 commented 1 year ago

Describe the bug

When setting workflow intermediate files to be cleared after not being needed anymore, this doesn't seem to work when running on the k8s runner. I get

galaxy.job_execution.actions.post DEBUG 2023-01-27 20:33:59,871 [pN:job_handler_0,p:8,tN:KubernetesRunner.work_thread-3] This job is not part of a workflow invocation, delete intermediates aborted.

does the runner need to do something with the job so that it gets marked as a workflow job? I would expect this to be foreign completely to the runner.

Galaxy Version and/or server at which you observed the bug Galaxy Version: 22.05 Commit: 36f80978e1b9743f413491a51f47c21ba522c6ed

To Reproduce

Steps to reproduce the behavior:

Make sure that the workflow has intermediate result files
Mark some of those as not needed as outputs (so that they are hidden)
Enable the cleanup of resulting files at those tools
Run on a Galaxy setup using the k8s runner

Expected behavior

Intermediate files should be deleted (and not only hidden) once they are not needed anymore by the pipeline

I would be happy to help with the missing functionality as well of getting rid of files permanently when the purge option is on, if I can get some directions of how it should be done.

mvdbeek commented 1 year ago

I don't think that's related to k8s. Can you provide a simple workflow to reproduce this ?

pcm32 commented 1 year ago

Ok, so this workflow works as intended with the intermediate deletion in galaxy.eu:

and it also works on my k8s setup :-( ... so could it be that the other behaviour is down to the multiple scheduling phases of my more complex workflow?

I guess I could try to reproduce it with some collections and filtering steps.

mvdbeek commented 1 year ago

The intermediate deletion is more of a best effort attempt at this point and doesn't take care of delayed scheduling ... we know there's a lot of work left to do this reliably

pcm32 commented 1 year ago

...ok, I was expecting that part to kick in here though:

https://github.com/galaxyproject/galaxy/blob/ffa2fb3e922f4fde5604bb9b404dfdc367ff70d6/lib/galaxy/job_execution/actions/post.py#L372

so maybe for some reason some other process is tagging this (due to the active workflow or other reason) before reaching here (which might be confusing when it comes to changing this functionality).

Could you give me some pointers please of where I would need to be looking to make this work by checking potential uses of files before steps are scheduled (this would still be constrained by the original workflow I reckon)? This would save us a lot of transient disk space. Thanks!

mvdbeek commented 1 year ago

I don't think this is something that we want to fix in a small hack or project, that's why I said "a lot of work". If you do want to work on this I would suggest implementing "post workflow actions" or checkpoints, to which you can attach cleanup actions, exports or other things (like sending an email, which is also borderline non-functional).

This could be implemented as a new step type that has dependencies on all (or some, for checkpoints) leaf datasets, or you could add another state in the invocation scheduler that waits for all jobs to complete. It will be challenging to do this in a performant way.

pcm32 commented 1 year ago

Are there any existing "post workflow actions" or checkpoints as you mention that I could look for examples in terms of where they go in the codebase, what do they extend, and so on? I didn't seem to find any. Or this would be a complete thing to architect? Where would you attach those "post workflow actions" in the job execution timeline if they don't exist? At the handler level after a job is marked as finished?

I would imagine at least the following path:

Successful execution of a workflow job, marked as finished by the handler, "post workflow/job action" grabs it.
Retrieve workflow invocation for that job
Get workflow from workflow invocation (is the workflow an object in memory at this point within the invocation?)
Map current job to a workflow step (perhaps this is already contained in the Job object?)
Map current job outputs to that workflow step output
Check which of those outputs are used as inputs or have a tick, mark the rest for purging.

I suspect there is more complexity in each step. I'm guessing everything would start somewhere from the job handler part, after the job is marked as finished, or is there any other object that takes care of jobs once they are marked as finished?

Thanks.

mvdbeek commented 1 year ago

Are there any existing "post workflow actions" or checkpoints as you mention that I could look for examples in terms of where they go in the codebase, what do they extend, and so on?

no

Or this would be a complete thing to architect?

yes

I would imagine at least the following path:

Successful execution of a workflow job, marked as finished by the handler, "post workflow/job action" grabs it.

Retrieve workflow invocation for that job

Get workflow from workflow invocation (is the workflow an object in memory at this point within the invocation?)

Map current job to a workflow step (perhaps this is already contained in the Job object?)

Map current job outputs to that workflow step output

Check which of those outputs are used as inputs or have a tick, mark the rest for purging.

that doesn't work when concurrent jobs finish and is too narrow for other similar usecases.

mvdbeek commented 1 year ago

I guess what I'm saying is that we do need this, it's a larger project that needs a week or more of full-time attention from someone familiar with the invocation lifecycle and the parallelism we get from collection jobs. We do have some related items on the roadmap, so I think we'll work on this in the coming months. That's not to say you couldn't give this a shot and we'll see if there's anything to be learned.

galaxyproject / galaxy

Workflow intermediate files not cleaned up on execution of workflows with delayed scheduling #15433