genome / analysis-workflows

Open workflow definitions for genomic analysis from MGI at WUSM.
MIT License
102 stars 57 forks source link

Add RNAseq sanity/QC checks #904

Open chrisamiller opened 4 years ago

chrisamiller commented 4 years ago

1) add a tool to verify strandedness for RNAseq: https://github.com/betsig/how_are_we_stranded_here

2) Add gene/transcript expression comparison/correlation between stringtie and kallisto

3) utilize ERCC spike-in data

malachig commented 4 years ago

For (1) and example of using strandedness tool and related discussion here: https://rnabio.org/module-01-inputs/0001/05/01/RNAseq_Data/ (Determining the strandedness of RNA-seq data section)

For (2) an example of how this correlation might look: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/Kallisto-StringTie-HTSeqCount_Comparisons.pdf

For gene level expression we also compare to htseq-count raw read counts. Which should also be highly correlated if things are working properly. Many people want raw counts, would be nice to have this in the pipeline as well.

For (3). Reference for the ERCC spike-in data: https://rnabio.org/assets/module_1/ERCC.pdf

Reference ERCC concentrations: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/ERCC_Controls_Analysis.txt

Example code to create a comparison table and visualization http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.pl http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.R

An example of what this comparison might look like: http://genomedata.org/rnaseq-tutorial/results/cbw2020/workspace/rnaseq/expression/htseq_counts/Tutorial_ERCC_expression.pdf

Note that in this example we compare to raw htseq counts. We could compare to StringTie and Kallisto TPMs instead.

jasonwalker80 commented 4 years ago

Another approach for #3 is to map to a genome with ERCC spiked in and then use a tool like fgbio: http://fulcrumgenomics.github.io/fgbio/tools/latest/CollectErccMetrics.html

malachig commented 4 years ago

Since the ERCC sequences are quite distinct. We could make perhaps make this analysis independent of reference genome and just do kallisto against those sequences alone and summarize based on that?