Add more explanation to caching behaviour with and without using '-resume'

carpentries-incubator / workflows-nextflow

Workflow management with Nextflow and nf-core

https://carpentries-incubator.github.io/workflows-nextflow/

Other

18 stars 29 forks source link

Add more explanation to caching behaviour with and without using '-resume' #56

Closed kerimoff closed 2 months ago

kerimoff commented 2 years ago

In the training we had recently, there was a question:

if the hash path under work directory is calculated using the following:

Inputs values

Input files

Command line string

Container ID

Conda environment

Environment modules

Any executed scripts in the bin directory

Then how come if we run the same pipeline twice without -resume the calculated hashes are different

I can speculate that if we run it without -resume then probably nextflow also takes into account some other value (e.g. current time) but I do not know for sure and could not easily find the answer. But it maybe a good idea to add some section to avoid confusion

mahesh-panchal commented 2 years ago

Without -resume Nextflow doesn't check anything. It just runs the workflow start to finish. Checkpointing is only used when using -resume.

kerimoff commented 2 years ago

Yes, but it is still generating a uniquely named folder under work directory. How is that folder name generated?

mahesh-panchal commented 2 years ago

This:

How is the hash calculated on input files?

The hash provides a convenient way for Nextflow to determine if a task requires recomputation. For each input file, the hash code is computed with:

The complete file path

The file size

The last modified timestamp

Therefore, even just performing a touch on a file will invalidate the task execution.

https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html

kerimoff commented 2 years ago

Thanks, @mahesh-panchal,

I think I could not communicate the question very well. I indeed checked both the material we have and the blog-posts about it. But, imagine a scenario where all these three values are the same:

The complete file path

The file size

The last modified timestamp

and when the -resume is not used and these three values are the only values to generate the name/hash of the task under the work directory, then if we run pipeline twice (again without -resume) then the generated hash should be the same. But it is not.

Do I miss anything here?

mahesh-panchal commented 2 years ago

Ah, I misunderstood (a little sleep deprived). The hash should also come from the session id too: https://github.com/nextflow-io/nextflow/blob/62ded342197ba7497d18b3f9d2d6d876547ce619/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1977-L2023

kerimoff commented 2 years ago

Thanks Mahesh.

Then, however, if the unique sessionID and the info listed below is used when the hash is calculated we can not achieve the same hash without the same sessionID when running with -resume. I hope I am the only one who is confused

Inputs values
Input files
Command line string
Container ID
Conda environment
Environment modules
Any executed scripts in the bin directory

I will check the code you have linked as soon as I can but until then if anyone has a clear understanding of how it works please try to explain step by step here or maybe try to add it to the material.

mahesh-panchal commented 2 years ago

My guess at the moment is -resume tiggers checking the previous log, and the new task ID is only created when a new task needs to be run. I think the directory hash and the one actually used for resuming are two separate things.

mahesh-panchal commented 2 years ago

Ask in slack. Perhaps ping Evan.