Open tbooth opened 4 months ago
The choice of fastx and velvet is deliberate. These tools are simple, stable, and serve the purpose of the tutorial which is to show how to orchestrate commands with Snakemake.
I will, as suggested, add notes that these are not the recommended tools for real analysis work. I don't propose to comment on what is the state-of-the-art as this is beyond the scope of the lesson and introduces a further maintenance burden on the lesson maintainer (ie., me).
For the genome assembly, the only thing we need to know is that Velvet is a program that will take a bunch of short reads (in paired FASTQ files) and try to build them into long contigs (output as a FASTA file), and we are aiming to make the longest possible contig by tuning a parameter called "K". Everything else is a distraction from this defined task!
When actually teaching the course, several learners have made the same points given above and asked to go into more detail of the assembly process, or other bioinformatics topics. But this is not the place to learn the "intricate challenge" of actual genome assembly and if any learner starts thinking it is then we are in trouble. I will add instructor notes that this should be clearly emphasised.
Also:
Kallisto performs (wording according to docs) a "pseuoalignment" - it is not a classical aligner and should not be mentioned as such.
Indeed - I'll correct this.
Comments from @cmeesters on this topic:
fastx is totally outdated, tools like cutadept are meanwhile good replacements. It is ok to use fastx for didactics, though, as participants can view all steps for quality processing in detail. A note on the state-of-art should be given regardless.
the assembly part comes out of the blue and is unrelated to everything before. If you want it, you need additional material, describing the background. Best put it into a separate chapter (or several), then.
genome assembly is an intricate challenge, recommending a relatively outdated tool like velvet is dangerous, as there are numerous follow-up implementation tailored for various genome types.