General feedbacks - Githubissues

nservant commented 8 months ago

Hi @bentsherman

Thank you so much for sharing ǹf-boost, the cleanup fonctionality is eagerly awaited by many users, including us :)

I made a first test case, and I would like to share my results with you ;

First, I had no issue to set up the plug-in.
I tested the cleanup on the test profile of my variant calling pipeline (sarek-like) with a very small dataset, and most of the time it runs without any issue. I only get one error, one time ;

##Command error:
##  Exception in thread "main" java.lang.RuntimeException: File not found 'D262E02_T_vs_D262E01_N_Mutect2_calls_norm_GnomAD_filtered_ICGC_CancerHotspots_COSMIC_dbNSFP.vcf.gz'
##          at org.snpeff.util.Gpr.reader(Gpr.java:501)
##          at org.snpeff.util.Gpr.reader(Gpr.java:484)
##          at org.snpeff.fileIterator.MarkerFileIterator.init(MarkerFileIterator.java:64)
##          at org.snpeff.fileIterator.FileIterator.<init>(FileIterator.java:39)
##          at org.snpeff.fileIterator.MarkerFileIterator.<init>(MarkerFileIterator.java:37)
##          at org.snpeff.fileIterator.VcfFileIterator.<init>(VcfFileIterator.java:82)
##          at org.snpsift.SnpSiftCmdExtractFields.run(SnpSiftCmdExtractFields.java:145)
##          at org.snpsift.SnpSiftCmdExtractFields.run(SnpSiftCmdExtractFields.java:122)
##          at org.snpsift.SnpSift.run(SnpSift.java:580)
##          at org.snpsift.SnpSift.main(SnpSift.java:76)

I think the intermediate vcf file has been deleted, before (or during) the next process starts to use it. This error happens on a NFS-based system on which the IO are a bit limited. I'm just wondering if an additional parameter to specify "after how many time, the intermediate file should be deleted" could be nice to avoid such issue.

Then, I did a real test using a pair of WES data (13.7Go each R1+R2 fastq file), and here, I should say that I was a bit disappointed because the gain in work space is not that high.

Here are the main steps of the pipeline ; mapping/BAM cleaning/GATK Mutect2/CNVs calling.

Here is a summary of the work space over time ;

The work reaches 100Go after all mapping post-processing steps. However, I'm a bit surprised that it was not almost completly clean up at the end of the pipeline.

Looking more carefully at the BAM files ... the pipeline generates ;

BAM files after bwa-mem
BAM files after MarkDup
BAM files after cleaning (duplicates, mapQ, etc.)
BAM files after BQSR

In practice, only the BAM files after MarkDup have been removed from the work.

>>find . -type f -name "*.bam"
./87/0315166cfca97935de47dbbcc990c4/BC2-0362-PXD_C_part1_hg38.bam
./45/619d03eaea4631cd6e3f6e05535376/BC2-0362-FIXT_T_part1_hg38.bam
./be/44f3273e49aa2c7fa999ae191325c7/BC2-0362-FIXT_T.filtered.bam
./09/6a3fe26f56ded56f5b1802a508b6b2/BC2-0362-PXD_C.filtered.bam
./c6/83094b4423e45d35729ed5518ad6d4/BC2-0362-FIXT_T.recal.bam
./df/e153ef0f160b440361f5576fdb449f/BC2-0362-PXD_C.recal.bam

Could you tell me more about when a given file should be deleted by the system ? In the coming days, I'll try to make additional test on another server with less IO latency to see if it has an impact. But please, let me know if you have any idea or additional test I can perform to help. Thanks Nicolas

bentsherman commented 8 months ago

Thanks Nicolas, this is exactly the kind of feedback I've been hoping for.

The basic strategy goes like this:

a file can be deleted when (1) all consumer tasks of the file are done and (2) the file has been published (if it needs to be published)
a task directory can be deleted when (1) all consumer tasks of any of its output files are done and (2) all of its outputs that need to be published have been published

Some caveats:

Files created by operators like splictCsv and collectFile cannot be tracked and will not be deleted. For optimal performance, use only processes to create/modify files (I have added several utilities to this plugin to make this as easy as possible)
The effectiveness of the cleanup is greatly impacted by how you write your pipeline. The more dependencies you add between processes, the harder it gets for the cleanup to delete files early.

If you can share your pipeline code, or some minimal example that reproduces the cleanup behavior, I can get a sense for how soon the files can be deleted. Similarly with the missing file error, I'd like to look at the pipeline code to see what's going on.

I can think of at least one way in which the cleanup is "playing" it safe and could likely be more aggressive...

Also I really like the disk usage plot you made, I recall we did a similar thing when we published our GEMmaker cleanup hack. Would you be willing to share that script so that I can add a generic version to this repo? It's exactly what we need to track the cleanup performance.

nservant commented 8 months ago

Hi @bentsherman Arff, I did not choose the best pipeline to share it. The related github project is broken ... however, I can share it as a tar.gz file if you wish ? Otherwise, I can run the same test on the ǹf-core-hicpipeline if it's easier ? Regarding the plot, I do not have any script ... I just run a simple for loop in parallel of the pipeline, which exec a du -h ./workevery 300sec. Then, put the results in a table and make a simple plot :)

bentsherman commented 8 months ago

I can work with whatever pipeline code you give me, though simpler is better. I'd really like to see the pipeline that produced that graph since that's the most concrete performance evaluation we have so far

nservant commented 8 months ago

I just send it to you by email (gmail address)

nservant commented 8 months ago

Just one comment. I realized that the only bams which were deleted are the ones which were published (the markdup bams). All intermediate bams which were not removed so far are those which are not published

bentsherman commented 7 months ago

Are those intermediate BAMs still specified as process outputs? I'm still reviewing the pipeline code, but it looks like the answer is yes.

If they aren't process outputs then the cleanup observer won't know about them and they won't be deleted until the task directory is deleted which could be much later. Of course you can just manually delete such files in the task script.

I think I have confirmed one of my suspicions though. Your pipeline follows the typical nf-core pattern of collecting tool versions from every process using multiqc, which makes multiqc a consumer of every process, which means the cleanup observer won't delete anything until the multiqc process has started. And every nf-core pipeline will have the same problem 😅

The cleanup observer just needs to be more fine-grained, it needs to consider dependencies between separate output channels separately. This way, the intermediate BAM files can be considered independently of the tool version metadata.

If my theory is correct, you should be able to remove the multiqc process and see much better cleanup results, assuming there aren't any other "high fan-in" processes which could cause the same effect.

nservant commented 7 months ago

Yes, all bams files are defined as process outputs. I'll make a test skipping the MultiQC process. I understand that the MultiQC process will be an issue for the task directory cleanup, but it should not affect the individual files deletion ? or did I miss something ?

bentsherman commented 7 months ago

The issue is not really multiqc, it's with the cleanup itself. I need to improve it so that the bam files can be handled independently of the multiqc logs.

I suggest disabling multiqc only as a way to test my theory, if you want to. My planned improvements should make the cleanup happen sooner regardless of multiqc

nservant commented 7 months ago

I run two tests without MultiQC, but the two times, I got the same error in one of the process

ac/06d2f9] process > identitoFlow:identitoPolym (BC2-0362-PXD_C)           [100%] 2 of 2 ✔
[a0/6f072d] process > identitoFlow:identitoCombine                          [100%] 2 of 2, failed: 2, retries: 1 ✘
...

Command exit status:
  1

Command output:
  (empty)

Command error:
  tail: cannot open ‘BC2-0362-PXD_C_matrix.tsv’ for reading: No such file or directory

Work dir:
/pipelines/vegan/cleanup/work_mqc/a0/6f072d281d3c22c75b7a39a015981e

The work folder of the process identitoCombine does not exist ... I think it has been removed before the job has finished. Of note, this process is very fast ... this is just a head|tail on a text file ...

bentsherman commented 7 months ago

Likely the cleanup observer is deleting the task directory too soon. I will investigate that as well

nservant commented 7 months ago

Hi @bentsherman A short update. I run two additional tests.

Removing the multiQCprocess. Almost no changes on the size of the work folder ... (31Go remaining at the end)
Publishing all BAM files ... which are now all removed from the work folder, leading to a 212Mo work space.

So my feeling is that there is something missing in the current version (or in my pipeline) to remove unpublished files. Thanks

bentsherman commented 7 months ago

Finally got some time to work on this today. I wrote a minimal pipeline to test my theory about multiqc. The pipeline looks like this:

flowchart TB
    subgraph " "
    v0["Channel.of"]
    end
    v1([BAM1])
    v2([BAM2])
    v3([BAM3])
    subgraph " "
    v4[" "]
    v8[" "]
    end
    v7([SUMMARY])
    v5(( ))
    v0 --> v1
    v1 --> v2
    v1 --> v5
    v2 --> v3
    v2 --> v5
    v3 --> v4
    v3 --> v5
    v5 --> v7
    v7 --> v8

And then I wrote some scripts to run the pipeline, track the disk usage, and produce a time series plot similar to yours above:

The sudden drop at the end is when the summary task starts, after which all of the intermediate bam files are deleted. This confirms my theory about needing to treat the output channels separately. It will require some refactoring but hopefully shouldn't take too long to implement.

I haven't figured out yet why your unpublished BAMs aren't being deleted. It doesn't happen with my test pipeline -- when I disable the summary task, the BAMs get deleted as soon as I would expect. So there might be some aspect of your pipeline that I haven't accounted for. I'll come back to this after I address the first issue.

bentsherman commented 7 months ago

@nservant I just released nf-boost 0.3.1 which includes an improved cleanup algorithm. It should delete the BAM files much sooner now. But let me know if you still see the issue with only published files being deleted.

nservant commented 7 months ago

great. I'll run new tests next week and let you know

nservant commented 7 months ago

@bentsherman, I was not able to run the 0.3.1 version. I do have a groovy error. Any idea ?

Apr-16 16:04:23.637 [Actor Thread 13] DEBUG nextflow.file.SortFileCollector - FileCollector temp dir not removed: null
Apr-16 16:04:23.638 [main] ERROR nextflow.cli.Launcher - @unknown
java.lang.NullPointerException: null
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.collect(DefaultGroovyMethods.java:3686)
    at nextflow.boost.cleanup.CleanupObserver.onFlowBegin(CleanupObserver.groovy:141)
    at nextflow.Session$_notifyFlowBegin_closure22.doCall(Session.groovy:1083)
    at nextflow.Session$_notifyFlowBegin_closure22.call(Session.groovy)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2357)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2342)
    at org.codehaus.groovy.runtime.DefaultGroovyMethods.each(DefaultGroovyMethods.java:2383)
    at nextflow.Session.notifyFlowBegin(Session.groovy:1083)
    at nextflow.Session.fireDataflowNetwork(Session.groovy:490)
    at nextflow.script.ScriptRunner.run(ScriptRunner.groovy:246)
    at nextflow.script.ScriptRunner.execute(ScriptRunner.groovy:137)
    at nextflow.cli.CmdRun.run(CmdRun.groovy:372)
    at nextflow.cli.Launcher.run(Launcher.groovy:499)
    at nextflow.cli.Launcher.main(Launcher.groovy:670)

bentsherman commented 7 months ago

Which Nextflow version are you using? I tested with 23.10.1

bentsherman commented 7 months ago

I think there are some cases in your pipeline that I have not accounted for. So as to not waste your time, I will try to run your pipeline myself to make sure it handles everything correctly. I see you have a small test profile so I should have everything I need

nservant commented 7 months ago

Same error with 23.10.1 The pipeline requires annotation files that are expected to be locally available ... so you will not be able to run it until the end.

However, if I just run ;

nextflow run main.nf -profile test

I also have the same error ...

bentsherman commented 7 months ago

Thanks, that helped me track down the bug. I can now run it successfully, at least up to the point of needing the input data.

So that I don't spam the main plugin registry with patch releases, please use this environment variable for now to test the latest version of the plugin:

export NXF_PLUGINS_TEST_REPOSITORY="https://github.com/bentsherman/nf-boost/releases/download/0.3.2/nf-boost-0.3.2-meta.json"

Once we can run your pipeline with good cleanup behavior, I'll publish the final release.

nservant commented 7 months ago

To follow up, I rerun my test on the pair of WES data. The new version works very well. The intermediates BAMs are well removed, regardless their publication status. No issue with MultiQC.

Here is a quick comparaison of the performance (dash = results folder, plain= work folder)

nfboost_compare

bentsherman commented 7 months ago

Awesome! I will go ahead and publish 0.3.2

bentsherman / nf-boost

General feedbacks #1