carpentries-incubator / snakemake-novice-bioinformatics

Introduction to Snakemake for Bioinformatics
https://carpentries-incubator.github.io/snakemake-novice-bioinformatics
Other
18 stars 9 forks source link

Motivation for countreads rule #59

Open tbooth opened 4 months ago

tbooth commented 4 months ago

From @cmeesters

Major Issue: here it does not pay out to start with a counting reads rule: there is no motivation to do so. It is not necessary and there is no scientific connection to the DAG. So, I consider this a - non-severe - breach of didactic 101.

tbooth commented 4 months ago

A plausible motivation for counting the reads both pre- and post- trimming is to see how many reads get discarded by the trim. But as it stands, we get close to this but in the middle of ep03, having successfully chained the trimreads and countreads rules, we then pivot and start adding Kallisto rules. In ep04, the read counts are then presented as an output of the workflow and we talk about the DAG concepts. Later, we present FastQC as taking the place of the countreads rule since it counts the reads and a lot more besides, and the old rule is discarded from the final workflow.

I don't think it's unreasonable to assume that a bioinformatician cares how many reads they are working with, but the story as it stands is pretty disjointed. How can we fix this?

Idea 1: Forget about counting reads and incorporate FastQC right away. I don't like this idea since FastQC produces two output files and has other issues dealt with in Ep06. Using the tool wrapper makes the rule easier to write, but then brings in the whole concept of wrappers which we are not yet ready for.

Idea 2: Finish the story by adding a count_discarded rule. Rather than introducing Kallisto in ep04, we could finish the episode by adding a rule to subtract the numbers and tell us explicitly how many reads were discarded. This introduces the concept/syntax of a rule having two inputs, shortens the too-long ep03, and also gives a reasonably complex DAG which can maybe then be used to cover all the points in ep04. Then we'd only add Kallisto after ep05 (probably inserting a new ep06 to do so and moving everything else back).

Idea 2 has some appeal, but it's a big change to this part of the course. Also, there are some downsides - it delays having any "real" bioinformatics tools that will supposedly motivate our bioinformaticians, and also the process of subtracting the numbers is so quick and trivial it seems a bit silly to talk about the advantages of Snakemake's lazy evaluation based on this example - making the Kallisto index is appreciably slower.

I'll park this for now and come back to it.

tbooth commented 1 month ago

I implemented something like "Idea 2" with a calculate_difference rule to give some closure/justification to the read counting rule. This allows me to keep ep03 more manageable by splitting it in half while still covering the syntax for multiple inputs. The new rule provides a simpler example of a rule with multiple inputs, without muddying the waters and introducing bioinformatics tools.

The new ep04 is basically the back half of the old ep03 but I now properly introduce log outputs. Possibly the use of log: should now be reiterated later in the course but this should be a small change.

I decided to keep the old ep05 (which was ep04) "The DAG" as-is, using the Kallisto examples to illustrate the theory. The downside is this makes things a little disjointed, as we introduce the calculate_difference rule in ep03 then never use it again. Re-writing ep05 as suggested above, and then switching it around with ep04, could be nice, but would be a major undertaking and may well introduce more ugliness in practise. I believe this approach here deals with the major problems.