Open chrisamiller opened 4 years ago
For (1) and example of using strandedness tool and related discussion here: https://rnabio.org/module-01-inputs/0001/05/01/RNAseq_Data/ (Determining the strandedness of RNA-seq data section)
For (2) an example of how this correlation might look: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/Kallisto-StringTie-HTSeqCount_Comparisons.pdf
For gene level expression we also compare to htseq-count raw read counts. Which should also be highly correlated if things are working properly. Many people want raw counts, would be nice to have this in the pipeline as well.
For (3). Reference for the ERCC spike-in data: https://rnabio.org/assets/module_1/ERCC.pdf
Reference ERCC concentrations: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/ERCC_Controls_Analysis.txt
Example code to create a comparison table and visualization http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.pl http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.R
An example of what this comparison might look like: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.pdf
Note that in this example we compare to raw htseq counts. We could compare to StringTie and Kallisto TPMs instead.
Another approach for #3 is to map to a genome with ERCC spiked in and then use a tool like fgbio: http://fulcrumgenomics.github.io/fgbio/tools/latest/CollectErccMetrics.html
Since the ERCC sequences are quite distinct. We could make perhaps make this analysis independent of reference genome and just do kallisto against those sequences alone and summarize based on that?
1) add a tool to verify strandedness for RNAseq: https://github.com/betsig/how_are_we_stranded_here
2) Add gene/transcript expression comparison/correlation between stringtie and kallisto
3) utilize ERCC spike-in data