common-workflow-language / cwltool

Common Workflow Language reference implementation
https://cwltool.readthedocs.io/
Apache License 2.0
332 stars 230 forks source link

intermediate outputs directories persist too long #951

Open mr-c opened 5 years ago

mr-c commented 5 years ago

From https://github.com/common-workflow-language/cwltool/pull/898#issuecomment-431154202 by @WESClarke

For me this still leaves intermediate directories using the example in #892.

First issue: output_dirs is always empty at L104 in my tests

https://github.com/common-workflow-language/cwltool/blob/e4bc1021ef406aff98fe68b1d29edb78eaf72c95/cwltool/executors.py#L98-L104

Second issue: job_dir is created using tmp_outdir_prefix but also does not get deleted. I tried setting L729 to shutil.rmtree(job_dir, True) without success.

https://github.com/common-workflow-language/cwltool/blob/e4bc1021ef406aff98fe68b1d29edb78eaf72c95/cwltool/job.py#L728-L729

My solutions are pretty naive but I am including them anyways.

Issue 1: I have moved the creation of the job_dir outside of the call to _job_popen and then called for its removal with the other temporary directories.

https://github.com/WEClarke/cwltool/blob/042902254882d96109f93b5291452626ee6f83f8/cwltool/job.py#L294-L306

https://github.com/WEClarke/cwltool/blob/042902254882d96109f93b5291452626ee6f83f8/cwltool/job.py#L385-L389

Issue 2: I have just used output_dirs = self.output_dirs for both cases, knowing that I must be missing something. However, I have tested this with and without --cachedir without issue.

https://github.com/WEClarke/cwltool/blob/042902254882d96109f93b5291452626ee6f83f8/cwltool/executors.py#L98-L105

mark-sp commented 5 years ago

Has anybody found a solution for this? This code has been merged into master branch but it is still not working as expected.

jprmachado commented 1 year ago

Does someone found a solution to this? I am running a workflow that produces huge amount of data, the final results are a small portion of the intermediary files, and while running some of then could be deleted.

We can implement a routine to delete it at the end but running multiple instance can easily be a problem for storage.

Is anyone working on this issue?

tetron commented 1 year ago

I think this issue was in reference to a specific bug where things that were supposed to be deleted at the end of the run were not.

Deleting intermediates that are no longer needed while the workflow is running is a little bit harder. You need logic that reasons about how intermediate results are going to be used by downstream steps and when they are no longer needed, and handles cases where an input is passed through and returned in the output (keeping it "live"). This is all feasible, CWL workflows definitely contain enough information required to figure this out, but not completely trivial.

jprmachado commented 1 year ago

@tetron Got it, thanks for the reply. I guess I misinterpret the issue title. But even at the end those files persist in my case and I am trying to debug it. I have considered that was my running conditions but I guess the problem still exists.