biostars / biostar-handbook

Issue tracker for the Biostar Handbook
57 stars 12 forks source link

RNA-Seq by Example, error in "5. Feature counting in RNA-Seq" #316

Closed BioinfGuru closed 10 months ago

BioinfGuru commented 10 months ago

Hi,

First, apologies if this issue has already been raised, I haven't found it.

In the section "How to count features", the commands should include --countReadPairs.

When I ran the commands as shown in the book, all counts are approximately double those shown in the example results image.

So the following command: cat ids.txt | parallel -j 1 echo "bam/{}.bam" | xargs featureCounts -p -a refs/features.gff -o counts.txt

Should be: cat ids.txt | parallel -j 1 echo "bam/{}.bam" | xargs featureCounts -p --countReadPairs -a refs/features.gff -o counts.txt

I don't understand enough about featureCounts. Could anyone explain why --countReadPairs has this effect?

Thanks for the great book

Regards, Bioinfguru

ialbert commented 10 months ago

Correct.

Unfortunately, what is going on here is that the featureCount program has changed its behavior via an ill-informed decision.

Starting with a specific version, one needs to pass both the -p and --countReadPairs flags, but before that version, only -p was needed for the same behavior, and the presence of the second flag raised an error.

Even as of last year the installation process installed the older version of featureCounts so I chose the first form. Now it seems the updated version gets installed so the second, two flag form will be required.

I will test the various installation methods and will make the change soon.

ialbert commented 10 months ago

I made the changes in the book, thanks for reporting and reminding me to make this change.

Nice job noticing it.

You are well on your way to bioinfoguru-ness, the bioinformatics world is full of inconsistencies!

ialbert commented 10 months ago

I just realized that I did not explain the effect itself,

when we run a single-end sequencing each read corresponds to a transcript fragment

when we run a paired-end sequencing each transcript fragment produces two reads.

Note that in the second case two measurements come from a single fragment. Hence during paired-end sequencing at the same coverage, only half as many independent transcripts will be sampled. This is the reason why the counts are half as much.

This is to say that paired-end sequencing is disadvantageous in any situation where we are counting reads since we lose half the data. We might gain more mapping accuracy - though that is debatable - but the net effect is losing half the coverage and we lose a lot of statistical power. So in general paired-end RNA-Seq is not advisable.

the only time it paired-end RNA-Seq makes sense is when we are assembling transcripts, for all other cases it leads to coverage loss

BioinfGuru commented 9 months ago

Thanks for that, cheers.

So for well annotated genomes, there's really no need for paired-end, especially if the goal is differential expression. Makes perfect sense. Thanks.