carpentries-incubator / snakemake-novice-bioinformatics

Introduction to Snakemake for Bioinformatics
https://carpentries-incubator.github.io/snakemake-novice-bioinformatics
Other
18 stars 9 forks source link

Reconsider Putting output before input #46

Open tbooth opened 1 year ago

tbooth commented 1 year ago

From @jdblischak

You have the learners write the output field before the input field. And your motivation is that it is natural to work backwards when writing a Snakefile, eg:

Rather than listing steps in order of execution, you are always working backwards from the final desired result. The order of operations is determined by applying the pattern matching rules to the filenames, not by the order of the rules in the Snakefile.

This logic of working backwards from the desired output is why we’re putting the output lines first in all our rules - to remind us that these are what Snakemake looks at first!

I am not a fan of this approach for two main reasons:

Pretty much any other Snakefile they encounter or tutorial they read will list input before output. As a concrete example, the official Snakemake tutorial. Having them write their Snakefiles different from everyone else adds unnecessary cognitive load While it's true that Snakemake works backwards just like Make does, and it's important for learners to understand this mental model, I don't think it is necessary for a Snakemake user to design their pipeline backwards. I always develop my Snakemake pipelines one rule at a time, in the forward order. While I have a vague sense of my final result, there are too many unknowns along the way. Inevitably I'll run into something frustrating like mismatched chromosomes between my sequencing files and the references files, and have to add a rule to fix this. In other words, I've never been able to follow your first step to "Define rules for all the processing steps". And even your lesson goes in the forward order, starting with trimming and counting before then adding rules for indexing and mapping So like I said above, I don't think you need to change your lesson. But I would recommended adding some boxes, eg:

box: We recommend listing output before input to remind yourself how Snakemake processes the rules, but note that this is our personal preference. Most other Snakefiles you see will list input first box: You can also build your pipeline one step at a time in the forward direction. Just make sure to always keep in mind that Snakemake processes the rules backwards

tbooth commented 4 months ago

It seems nobody likes my output/input/shell ordering. Per @cmeesters:

putting output before input is syntactically correct, but violates the de-facto standard we use for workflows. It should not be introduced as good practice.

Lots of things that are common practise are bad practise, but it seems that I'm overruled here. I guess I'll look to switch the code around. Going back to the comments from @jdblischak, there's a conflation here between the order in which rules are added in the workflow design and the order of output/input/shell within a single rule. I completely agree that Snakemake users should not be expected to "design their pipeline backwards" but this has nothing to do with how a rule is written. Snakemake DOES evaluate rules in the order of output, then input, then shell and this is not a design choice it's a simple fact about how Snakemake works. I've recently answered a bunch of Snakemake questions on Stackoverflow and almost half of them are people who are struggling because they have not grasped this fundamental idea!