ctmrbio / BACTpipe

BACTpipe: An assembly and annotation pipeline for bacterial genomics
https://bactpipe.readthedocs.org
MIT License
20 stars 7 forks source link

Add cleanup directive to config file to automatically remove work dir #105

Closed boulund closed 3 years ago

boulund commented 5 years ago

According to https://github.com/nextflow-io/nf-hack18/issues/3 there is an undocumented feature to remove the work dir.

thorellk commented 3 years ago

@abhi18av, can you confirm this? If this is the case, it would be great to implement this to save space if the pipeline has completed without errors.

boulund commented 3 years ago

In another nextflow workflow that we use we used hardlinks for all output files to reduce the disk space usage. Whenever everything is done, one can easily remove the work directory without risking to delete anything in the published directories.

abhi18av commented 3 years ago

In general, relying on undocumented stuff might bind us to specific version and put us in technical debt overall. Nextflow development moves quite fast.

I think a simpler alternative could be to simply switch the publish mode to move rather than copy or link. In this case, as soon as the process exists successfully - it'll move the output of that process to the specified publishDir.

What do you guys think?

boulund commented 3 years ago

Are there any interaction effects with the scratch directive that we need to consider then? Thinking of cluster environments where the scratch dir could be on a node-local disk and the publishdir on a shared network file system. (not that BACTpipe produces very large output files, so it's not likely to become a huge problem--people misbehave on these systems all the time anyway :) )

abhi18av commented 3 years ago

I've note worked extensively with HPC systems but conceptually, I believe they are similar to AWS Batch or Azure Batch environments (multiple synced machines , in which case the publishDir "xyz", mode: "move" should work :)

In any case, as you mentioned, isn't isn't a huge pipeline so we can iterate a couple times and finalize.

thorellk commented 3 years ago

I guess this is still not solved/agreed on, right?

boulund commented 3 years ago

I'm not sure what I think about this anymore. Moving output files to publishDir's is a convenient solution that makes it easy for users to just delete the work dir when they're happy with a run. However, I think it can make troubleshooting a bit more difficult, as all files are no longer present in their work dirs... 🤔

I rarely run BACTpipe in environments where I'm super constrained disk space wise, so the disk space arguments don't really apply to me. Also, it's nice sometimes to find a large old work dir to delete (wow, free disk space!) ;)

thorellk commented 3 years ago

I recognise that feeling ;) I am fine with leaving it as it is for now.

boulund commented 3 years ago

Shall we close this for now then, and reopen or create a new issue if we want to raise this in the future?

thorellk commented 3 years ago

Let's do so!