Closed nservant closed 3 weeks ago
Thanks Nicolas, this is exactly the kind of feedback I've been hoping for.
The basic strategy goes like this:
a file can be deleted when (1) all consumer tasks of the file are done and (2) the file has been published (if it needs to be published)
a task directory can be deleted when (1) all consumer tasks of any of its output files are done and (2) all of its outputs that need to be published have been published
Some caveats:
Files created by operators like splictCsv
and collectFile
cannot be tracked and will not be deleted. For optimal performance, use only processes to create/modify files (I have added several utilities to this plugin to make this as easy as possible)
The effectiveness of the cleanup is greatly impacted by how you write your pipeline. The more dependencies you add between processes, the harder it gets for the cleanup to delete files early.
If you can share your pipeline code, or some minimal example that reproduces the cleanup behavior, I can get a sense for how soon the files can be deleted. Similarly with the missing file error, I'd like to look at the pipeline code to see what's going on.
I can think of at least one way in which the cleanup is "playing" it safe and could likely be more aggressive...
Also I really like the disk usage plot you made, I recall we did a similar thing when we published our GEMmaker cleanup hack. Would you be willing to share that script so that I can add a generic version to this repo? It's exactly what we need to track the cleanup performance.
Hi @bentsherman
Arff, I did not choose the best pipeline to share it.
The related github project is broken ... however, I can share it as a tar.gz file if you wish ?
Otherwise, I can run the same test on the ǹf-core-hic
pipeline if it's easier ?
Regarding the plot, I do not have any script ... I just run a simple for
loop in parallel of the pipeline, which exec a du -h ./work
every 300sec. Then, put the results in a table and make a simple plot :)
I can work with whatever pipeline code you give me, though simpler is better. I'd really like to see the pipeline that produced that graph since that's the most concrete performance evaluation we have so far
I just send it to you by email (gmail address)
Just one comment. I realized that the only bams which were deleted are the ones which were published (the markdup bams). All intermediate bams which were not removed so far are those which are not published
Are those intermediate BAMs still specified as process outputs? I'm still reviewing the pipeline code, but it looks like the answer is yes.
If they aren't process outputs then the cleanup observer won't know about them and they won't be deleted until the task directory is deleted which could be much later. Of course you can just manually delete such files in the task script.
I think I have confirmed one of my suspicions though. Your pipeline follows the typical nf-core pattern of collecting tool versions from every process using multiqc, which makes multiqc a consumer of every process, which means the cleanup observer won't delete anything until the multiqc process has started. And every nf-core pipeline will have the same problem 😅
The cleanup observer just needs to be more fine-grained, it needs to consider dependencies between separate output channels separately. This way, the intermediate BAM files can be considered independently of the tool version metadata.
If my theory is correct, you should be able to remove the multiqc process and see much better cleanup results, assuming there aren't any other "high fan-in" processes which could cause the same effect.
Yes, all bams files are defined as process outputs. I'll make a test skipping the MultiQC process. I understand that the MultiQC process will be an issue for the task directory cleanup, but it should not affect the individual files deletion ? or did I miss something ?
The issue is not really multiqc, it's with the cleanup itself. I need to improve it so that the bam files can be handled independently of the multiqc logs.
I suggest disabling multiqc only as a way to test my theory, if you want to. My planned improvements should make the cleanup happen sooner regardless of multiqc
I run two tests without MultiQC, but the two times, I got the same error in one of the process
ac/06d2f9] process > identitoFlow:identitoPolym (BC2-0362-PXD_C) [100%] 2 of 2 ✔
[a0/6f072d] process > identitoFlow:identitoCombine [100%] 2 of 2, failed: 2, retries: 1 ✘
...
Command exit status:
1
Command output:
(empty)
Command error:
tail: cannot open ‘BC2-0362-PXD_C_matrix.tsv’ for reading: No such file or directory
Work dir:
/pipelines/vegan/cleanup/work_mqc/a0/6f072d281d3c22c75b7a39a015981e
The work folder of the process identitoCombine
does not exist ... I think it has been removed before the job has finished. Of note, this process is very fast ... this is just a head|tail
on a text file ...
Likely the cleanup observer is deleting the task directory too soon. I will investigate that as well
Hi @bentsherman A short update. I run two additional tests.
multiQC
process. Almost no changes on the size of the work folder ... (31Go remaining at the end)So my feeling is that there is something missing in the current version (or in my pipeline) to remove unpublished files. Thanks
Finally got some time to work on this today. I wrote a minimal pipeline to test my theory about multiqc. The pipeline looks like this:
flowchart TB
subgraph " "
v0["Channel.of"]
end
v1([BAM1])
v2([BAM2])
v3([BAM3])
subgraph " "
v4[" "]
v8[" "]
end
v7([SUMMARY])
v5(( ))
v0 --> v1
v1 --> v2
v1 --> v5
v2 --> v3
v2 --> v5
v3 --> v4
v3 --> v5
v5 --> v7
v7 --> v8
And then I wrote some scripts to run the pipeline, track the disk usage, and produce a time series plot similar to yours above:
The sudden drop at the end is when the summary task starts, after which all of the intermediate bam files are deleted. This confirms my theory about needing to treat the output channels separately. It will require some refactoring but hopefully shouldn't take too long to implement.
I haven't figured out yet why your unpublished BAMs aren't being deleted. It doesn't happen with my test pipeline -- when I disable the summary task, the BAMs get deleted as soon as I would expect. So there might be some aspect of your pipeline that I haven't accounted for. I'll come back to this after I address the first issue.
@nservant I just released nf-boost 0.3.1 which includes an improved cleanup algorithm. It should delete the BAM files much sooner now. But let me know if you still see the issue with only published files being deleted.
great. I'll run new tests next week and let you know
@bentsherman, I was not able to run the 0.3.1 version. I do have a groovy error. Any idea ?
Apr-16 16:04:23.637 [Actor Thread 13] DEBUG nextflow.file.SortFileCollector - FileCollector temp dir not removed: null
Apr-16 16:04:23.638 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.NullPointerException: null
at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3686)
at nextflow.boost.cleanup.CleanupObserver.onFlowBegin(CleanupObserver.groovy:141)
at nextflow.Session$_notifyFlowBegin_closure22.doCall(Session.groovy:1083)
at nextflow.Session$_notifyFlowBegin_closure22.call(Session.groovy)
at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2357)
at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2342)
at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2383)
at nextflow.Session.notifyFlowBegin(Session.groovy:1083)
at nextflow.Session.fireDataflowNetwork(Session.groovy:490)
at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:246)
at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:137)
at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
at nextflow.cli.Launcher.run(Launcher.groovy:499)
at nextflow.cli.Launcher.main(Launcher.groovy:670)
Which Nextflow version are you using? I tested with 23.10.1
I think there are some cases in your pipeline that I have not accounted for. So as to not waste your time, I will try to run your pipeline myself to make sure it handles everything correctly. I see you have a small test profile so I should have everything I need
Same error with 23.10.1
The pipeline requires annotation files that are expected to be locally available ... so you will not be able to run it until the end.
However, if I just run ;
nextflow run main.nf -profile test
I also have the same error ...
Thanks, that helped me track down the bug. I can now run it successfully, at least up to the point of needing the input data.
So that I don't spam the main plugin registry with patch releases, please use this environment variable for now to test the latest version of the plugin:
export NXF_PLUGINS_TEST_REPOSITORY="https://github.com/bentsherman/nf-boost/releases/download/0.3.2/nf-boost-0.3.2-meta.json"
Once we can run your pipeline with good cleanup behavior, I'll publish the final release.
To follow up, I rerun my test on the pair of WES data. The new version works very well. The intermediates BAMs are well removed, regardless their publication status. No issue with MultiQC.
Here is a quick comparaison of the performance (dash = results folder, plain= work folder)
Awesome! I will go ahead and publish 0.3.2
Hi @bentsherman
Thank you so much for sharing
ǹf-boost
, thecleanup
fonctionality is eagerly awaited by many users, including us :)I made a first test case, and I would like to share my results with you ;
cleanup
on the test profile of my variant calling pipeline (sarek-like) with a very small dataset, and most of the time it runs without any issue. I only get one error, one time ;I think the intermediate vcf file has been deleted, before (or during) the next process starts to use it. This error happens on a NFS-based system on which the IO are a bit limited. I'm just wondering if an additional parameter to specify "after how many time, the intermediate file should be deleted" could be nice to avoid such issue.
Here are the main steps of the pipeline ; mapping/BAM cleaning/GATK Mutect2/CNVs calling.
Here is a summary of the work space over time ;
The work reaches 100Go after all mapping post-processing steps. However, I'm a bit surprised that it was not almost completly clean up at the end of the pipeline.
Looking more carefully at the BAM files ... the pipeline generates ;
In practice, only the BAM files after MarkDup have been removed from the work.
Could you tell me more about when a given file should be deleted by the system ? In the coming days, I'll try to make additional test on another server with less IO latency to see if it has an impact. But please, let me know if you have any idea or additional test I can perform to help. Thanks Nicolas