cgat-developers / cgat-flow

cgat-flow repository
MIT License
13 stars 9 forks source link

Strandedness is the wrong way around in bamstats #58

Closed jscaber closed 5 years ago

jscaber commented 5 years ago

Dear All,

Pipeline rnaseqqc produces the correct strand output ISF/ISR - this is the output from salmon auto I believe.

Pipeline bamstats gets it wrong: The strand metrics come from PicardRNASeqMetrics, which expects the strandedness to be provided. This fails, because of the confiusion between firststrand and secondstrand introduced by the tuxedo suite (they were the first to introduce firststrand notation for the reverse strand).

So to recap: The most common Illumina Stranded prep kits and most other stranded prep kits available have the first read on the reverse strand.

Hence: RF for hisat etc ISR for salmon/sailfish fr-firststrand for Tophat Featurecounts: 2 SECOND_READ_TRANSCRIPTION_STRAND for Picard RNASeq stats.

I will fix this issue in the pipeline.

Jakub

Acribbs commented 5 years ago

Ah thanks Jakub!

jscaber commented 5 years ago

Now done some testing and removed the reference to Tophat style libraries above the code (the code itself the correct way around but unused). Also added back the strandedness option in pipeline.yml which is needed for picard.

Bamstats still gets the libraries wrong though, and that's got to do with the tool bam2libtype...

jscaber commented 5 years ago

Since the remaining issue is with bam2libtype, should we close this here and move to cgat-apps? bam2libtype tested as not working with 3 datasets, both stranded and unstranded.

jscaber commented 5 years ago

testing of bam2libtype validated using 2 tools that worked on the same libraries yielding concordant results: infer_librarytype.py for RSEQC and salmon autodetect.

AndreasHeger commented 5 years ago

Thanks, yes, please open a new issue on cgat-apps and reference this one.