TheJacksonLaboratory / splicing-pipelines-nf

Repository for the Anczukow-Lab splicing pipeline
14 stars 10 forks source link

Adds workdir cleanup option #238

Closed Vlad-Dembrovskyi closed 3 years ago

Vlad-Dembrovskyi commented 3 years ago

Description

This PR addresses the issue #217 that requires an option to cleanup the work folder in the end of pipeline execution, as it often contains a lot of data for a large-scale run that occupies many Gigabytes, even up to Terabytes of disc space. image

Note: Before merging current PR to dev branch, merge #239 to current branch. [done]

Solution

Luckily enough, nextflow has a hidden undocumented feature to cleanup all temporary files that Paolo Di Tommaso revealed in one of his nf-kacks. There was a problem with it - it failed to cleanup files for our pipeline when running with Docker profile with a non-descriptive groovy error message Failed to cleanup work dir: ... (but pipeline still worked fine). The reason was groovy couldn't delete files owned by a root user in work folder. Such files are created by default by all processes that run through a docker container. image

I solved the issue by adding specific config docker options to set user and user group to current user docker.runOptions = '-u $(id -u):$(id -g)'. After that the new cleanup option started working like a charm. It is exposed as a parameter --cleanup and can be used simply as nextflow run ... --cleanup true, or even as a flag nextflow run ... --cleanup. Singularity profile does not have user ownership issues thus worked fine out of the box (tested in #239, has to be merged to current PR before this one is merged to dev [done]).

Note: this option cleans all the workdirs of all processes, but doesn't clean the staged files from stage folder. Even despite that this is already a tremendous win in disc space. image

To test

To test that new feature is actually working run the pipeline in ultra quick test mode with and w/o --cleanup option enabled, and check the work folder sizes in both cases.

  1. Clone the repo:
    
    git clone https://github.com/TheJacksonLaboratory/splicing-pipelines-nf
    cd splicing-pipelines-nf
    git checkout adds-workdir-cleanup-option

2. Run without `--cleanup` option, default:

nextflow run . -profile ultra_quick_test,docker du -sh work rm -r work

![image](https://user-images.githubusercontent.com/64809705/125580793-05ede512-25ab-450a-b1dc-c197e88ad65e.png)

3. Run with `--cleanup` option:

nextflow run . -profile ultra_quick_test,docker --cleanup du -sh work du -sh work/*

![image](https://user-images.githubusercontent.com/64809705/125581503-b85eb123-e682-4b3c-9673-fe7e4e0fb625.png)

4. Test same with Singularity:
(requires #239 to be merged to current PR first [done]):

nextflow run . -profile ultra_quick_test,singularity_local --cleanup du -sh work du -sh work/*


## Limitations
1. The `cleanup` feature does not clear the staged files, only the processes' workdirs.
![image](https://user-images.githubusercontent.com/64809705/125583801-fcf0284e-91f5-4b2e-8be7-3e66b1e27d5d.png)

2. The cleanup is only executed on pipeline successful completion. If pipeline fails or is interrupted mid way - no intermediate files will be deleted. This on one hand allows for safe simultaneous usage of `-resume` and `--cleanup` options, but on the other hand will lead to workdir growth until pipeline is finally completed successfully. Once it does, only the workidr folders created during the latest successful run will be cleared by a cleanup. Any folders that came as cached from previous failed runs that were resumed will not be cleared with this cleanup option.
Failed run does not clear the workdirs:
![image](https://user-images.githubusercontent.com/64809705/125583898-43a5c97f-e04c-49d7-b96e-53067f98935d.png)
Next successful run resumed from the previous one only clears the newly created folders:
![image](https://user-images.githubusercontent.com/64809705/125583997-ed337846-4cdf-4371-97ae-b6b9854ec2fa.png)
![image](https://user-images.githubusercontent.com/64809705/125584041-9aee4e31-ebee-4add-9bb4-114483d36393.png)

3. So far the feature was only tested for `docker` and `singulatity` profiles (see #239). It has not been tested in an HPC environment yet. 
In theory, in all other cases that don't change intermediate file ownership (like docker did with `root`) the `cleanup` option should work fine as well. But until we test it, we can't guarantee. In the worst case nextflow will not be able to clean the workdir and print a WARN message `Failed to cleanup work dir: ...`, but the pipeline won't fail and results will be saved as if cleanup option wasn't enabled at all. 
![image](https://user-images.githubusercontent.com/64809705/125585319-8c2ffc32-9fa5-4618-b1d7-c6fcac074964.png)

So, current PR requires a test in an HPC environment with profiles `sumner` and `ultra_quick_test` before merging to `dev`:

git clone https://github.com/TheJacksonLaboratory/splicing-pipelines-nf cd splicing-pipelines-nf git checkout adds-workdir-cleanup-option nextflow run . -profile ultra_quick_test,sumner --cleanup du -sh work/*

Vlad-Dembrovskyi commented 3 years ago

This is an absolute interesting PR and learning, as it implements the hidden part of Nextflow. Thank you for digging deep to make it work 🦾

Just a suggestion, Can we have a message with workflow.onComplete scope to let the user know it's cleaned the work directory after workflow completion. Because, when running in HPC (where this functionality will be mostly used) generates stdout to a file for checking on a later stage. Having this message on this stdout file will give the user more clarity, what has happened after the job run.

Also as a bonus, if we can point to work directory path ($workflow.workDir) in the message would be great.

@sk-sahu Added with latest commit. image image

To test successful notification just run the example command with ultra_quick_test and docker profiles. To test the failed pipeline message introduce an error to latest step (for example touch folder in non-existing file touch ghtjf/vsuyr) and run same command.

angarb commented 3 years ago

@Vlad-Dembrovskyi one suggestion - we rarely run the pipeline from the command line. We generally submit a standard main.pbs and a config file. Is the a way to add the cleanup param to NF_splicing_pipeline.config?

Vlad-Dembrovskyi commented 3 years ago

Hi @angarb. For sure you can add this cleanup option to any config you use to run the pipeline. For that you just need to add this line to the very end of the config that you use:

cleanup = true

I can't find the NF_splicing_pipeline.config in this repository, so I assume its a config you are using in your working environment. So you have to add the line yourself. But it is as I said as easy as to copy paste the line above to the end of the config. That should work :) Having this line in a config will set the cleanup to happen. Let me know if that works for you.

adeslatt commented 3 years ago

@angarb it would be good to add this configuration file NF_splicing_pipeline.config to the GitHub repository - as the sumner configuration requiirements.

angarb commented 3 years ago

Hi @adeslatt we do have this config in the repository - the example is here (splicing-pipelines-nf/conf/examples/MYC_MCF10A_0h_vs_MYC_MCF10A_8h.config) and parameter descriptions here (https://github.com/TheJacksonLaboratory/splicing-pipelines-nf/blob/master/docs/usage.md#all-available-parameters)

@PhilPalmer had outlined these steps when updating a parameter:

  1. Add them to the nextflow.config file with a default value
  2. Add them to main.nf: Add them to the helpMessage Add them to the log Add them to the required process script
  3. Add them to docs/usage

@Vlad-Dembrovskyi - should I just add cleanup to these param lists?

Vlad-Dembrovskyi commented 3 years ago

@angarb I didn't know the MYC_MCF10A_0h_vs_MYC_MCF10A_8h.config file is the same as NF_splicing_pipeline.config which @adeslatt referred to, sorry.

Anyhow. Yes, you can add --cleanup parameter with description from here to all-available-parameters part, under the --debug option.

And you can also add it as cleanup = true to your config file, but important: not inside the params scope, but outside of it. Just add it as the last line of your config file as if there was nothing else in your config file. It is an isolated standalone nextflow option when you provide it with a config file.