Closed kerimoff closed 2 months ago
Without -resume
Nextflow doesn't check anything. It just runs the workflow start to finish. Checkpointing is only used when using -resume
.
Yes, but it is still generating a uniquely named folder under work
directory. How is that folder name generated?
This:
How is the hash calculated on input files?
The hash provides a convenient way for Nextflow to determine if a task requires recomputation. For each input file, the hash code is computed with:
- The complete file path
- The file size
- The last modified timestamp
Therefore, even just performing a touch on a file will invalidate the task execution.
https://www.nextflow.io/blog/2019/demystifying-nextflow-resume.html
Thanks, @mahesh-panchal,
I think I could not communicate the question very well. I indeed checked both the material we have and the blog-posts about it. But, imagine a scenario where all these three values are the same:
- The complete file path
- The file size
- The last modified timestamp
and when the -resume
is not used and these three values are the only values to generate the name/hash of the task under the work directory, then if we run pipeline twice (again without -resume) then the generated hash should be the same. But it is not.
Do I miss anything here?
Ah, I misunderstood (a little sleep deprived). The hash should also come from the session id too: https://github.com/nextflow-io/nextflow/blob/62ded342197ba7497d18b3f9d2d6d876547ce619/modules/nextflow/src/main/groovy/nextflow/processor/TaskProcessor.groovy#L1977-L2023
Thanks Mahesh.
Then, however, if the unique sessionID and the info listed below is used when the hash is calculated we can not achieve the same hash without the same sessionID when running with -resume
. I hope I am the only one who is confused
I will check the code you have linked as soon as I can but until then if anyone has a clear understanding of how it works please try to explain step by step here or maybe try to add it to the material.
My guess at the moment is -resume
tiggers checking the previous log, and the new task ID is only created when a new task needs to be run. I think the directory hash and the one actually used for resuming are two separate things.
Ask in slack. Perhaps ping Evan.
In the training we had recently, there was a question:
I can speculate that if we run it without
-resume
then probably nextflow also takes into account some other value (e.g. current time) but I do not know for sure and could not easily find the answer. But it maybe a good idea to add some section to avoid confusion