Starting from FASTQ or BAM generates different results

smshuai commented 5 years ago

Hi Kipp,

I have met another issue when I analyzed my data (stranded total RNA-Seq) from FASTQ files. Only 9 circRNAs were detected but previously using BAMs as the starting point I found more than 1000 circRNAs. It seems that most circRNAs are in circRNA.5reads.3ind.countmatrix.NotStrandImbalanceReduced but not circRNA.5reads.3ind.countmatrix:

# 4 steps procedure
$ wc -l circRNA.5reads.3ind.*
    9 circRNA.5reads.3ind.annotated
    9 circRNA.5reads.3ind.countmatrix
 1358 circRNA.5reads.3ind.countmatrix.NotStrandImbalanceReduced
    9 circRNA.5reads.3ind.genes

Whilst with STAR-aligned BAMs, I got:

# 2 steps procedure
$ wc -l circRNA.5reads.2ind.*
    105 circRNA.5reads.2ind.0cpm_0samples_variance_PCA.pdf
   1661 circRNA.5reads.2ind.annotated
   1661 circRNA.5reads.2ind.countmatrix
   1661 circRNA.5reads.2ind.genes
    182 circRNA.5reads.2ind.heatmap.pdf

My questions are:

Are the problem associated with my RNA-Seq data type which is stranded?
I know using FASTQ files and realignments are better, but In practice, what's the correlation in terms of quantification?
Slightly unrelated, can you perform differential expression analysis using these counts?

Best, Shimin

kippakers commented 5 years ago

Hi Shimin,

YES. STARChip currently assumes your data isn't stranded, but circRNA.5reads.3ind.countmatrix.NotStrandImbalanceReduced has all the count matrix data you should use.
Great question, it really depends a lot on your length of reads. For shorter reads (~50 bp), it makes a huge difference. By 100bp reads it's not nearly as critical, but I think it's still good practice. Paired end reads also strongly empower quantitation without realignment. This is all theory + anecdotal evidence, I haven't had the chance to thoroughly study the effects.
Yes! that's the idea behind STARChip. If you're using something like CPM, make sure you're using the correct denominator: that is, your PerMillionMapped value should be for LINEAR features mapped from the same fastq file(s). In practice, circRNA read counts can be very low (depending on your experiment), so either a) filter circRNA the same way you would genes for expression (e.g. > 1cpm in 10% of samples) or b) be aware that low end circRNA reads are noisy and may be misbehaving in your statistical models.

smshuai commented 5 years ago

Hi Kipp,

Thank you for your reply. I also found a simple solution by disabling the strand imbalance filter in starchip/scripts/circles/cpm1.R:

### Modified for stranded data
output_all$Kept = TRUE
# output_all$Kept = output_all$samplesWithStrandImbalanceOver2 <= (ncol(gcounts_ratio_adjusted)/2)

Maybe you can make that an option in the future.

Best, Shimin

LosicLab / starchip

Starting from FASTQ or BAM generates different results #24