carpentries-incubator / snakemake-novice-bioinformatics

Introduction to Snakemake for Bioinformatics
https://carpentries-incubator.github.io/snakemake-novice-bioinformatics
Other
16 stars 9 forks source link

Motivation for countreads rule #59

Open tbooth opened 1 month ago

tbooth commented 1 month ago

From @cmeesters

Major Issue: here it does not pay out to start with a counting reads rule: there is no motivation to do so. It is not necessary and there is no scientific connection to the DAG. So, I consider this a - non-severe - breach of didactic 101.

tbooth commented 1 month ago

A plausible motivation for counting the reads both pre- and post- trimming is to see how many reads get discarded by the trim. But as it stands, we get close to this but in the middle of ep03, having successfully chained the trimreads and countreads rules, we then pivot and start adding Kallisto rules. In ep04, the read counts are then presented as an output of the workflow and we talk about the DAG concepts. Later, we present FastQC as taking the place of the countreads rule since it counts the reads and a lot more besides, and the old rule is discarded from the final workflow.

I don't think it's unreasonable to assume that a bioinformatician cares how many reads they are working with, but the story as it stands is pretty disjointed. How can we fix this?

Idea 1: Forget about counting reads and incorporate FastQC right away. I don't like this idea since FastQC produces two output files and has other issues dealt with in Ep06. Using the tool wrapper makes the rule easier to write, but then brings in the whole concept of wrappers which we are not yet ready for.

Idea 2: Finish the story by adding a count_discarded rule. Rather than introducing Kallisto in ep04, we could finish the episode by adding a rule to subtract the numbers and tell us explicitly how many reads were discarded. This introduces the concept/syntax of a rule having two inputs, shortens the too-long ep03, and also gives a reasonably complex DAG which can maybe then be used to cover all the points in ep04. Then we'd only add Kallisto after ep05 (probably inserting a new ep06 to do so and moving everything else back).

Idea 2 has some appeal, but it's a big change to this part of the course. Also, there are some downsides - it delays having any "real" bioinformatics tools that will supposedly motivate our bioinformaticians, and also the process of subtracting the numbers is so quick and trivial it seems a bit silly to talk about the advantages of Snakemake's lazy evaluation based on this example - making the Kallisto index is appreciably slower.

I'll park this for now and come back to it.