alexdobin / STAR

RNA-seq aligner
MIT License
1.85k stars 505 forks source link

How to calculate RPM from .tab output of --quantMode GeneCounts #1223

Open carlotta-93 opened 3 years ago

carlotta-93 commented 3 years ago

Hi Alex, Firstly thank you for the great development of this tool.

I have a question concerning the gene counts from the .tab output when I set --quantMode GeneCounts and more specifically how to get from those counts to RPM (reads per million).

I have smallRNA-Seq data, unstranded, so when I retrieve the gene counts, I am only focusing on the first column of the .tab files.

Now, if I compute the column sum of a matrix (4 samples) built by joining all the .tab files (only 1st column), my understanding is that I should get the number of mapped reads in each sample. Is this assumption "even" correct ?

This is the output of the column sums:

  S1              S2          S3      S4 

1169953 821332 755315 1050780

If my assumption is correct, then I should see the same number of "uniquely mapped reads" in the output of .Log.final.out files, right ?

This is a screenshot of the .Log.final.out for each of my samples: _log_out_s1s4_smallrnaseq.pdf

As you can see the colsums that I get from the .tab files are actually closer to the Number of input reads from the log files than to the number of mapped reads.

My plan was to use these colsums to turn the read counts into RPM but now I am only more confused.

Can you help ?

Thanks a lot

alexdobin commented 3 years ago

Hi Carlotta,

if you sum all rows in the ReadsPerGene.out.tab file, except first two (i.e. N_noFeature+N_ambiguous+all genes) you should get the "Uniquely mapped reads number" from Log.final.out file. N_multimapping should be equal "Number of reads mapped to multiple loci"

Indeed the sum of all columns should be equal to the number of input reads, but I think there is a bug with the N_unmapped, it should be equal to Number of reads unmapped: too many mismatches + too short + other + Number of reads mapped to too many loci. I checked on my test examples, and it does not agree.

For RPM, calculation, I think the best thing is to divide the gene counts by the sum of gene counts, i.e. you will exclude reads that are not assigned to genes (4 top rows).

Cheers Alex

carlotta-93 commented 3 years ago

Hi Alex, Thanks for taking the time to look into this.

In the end I chose a different approach because the read counts from the standard htseq command do not allow me to include multimappers (*) and since these are smallRNA-Seq libraries, I have to take that into account.

Nonetheless, I would give this approach another try on my next mRNA-Seq project :D

(*) In this regard, it would be actually useful to be able to set some more parameters in the --quantMode, for htseq to use, if that is not already implemented.

Thanks for getting back to me anyway, really appreciated.

Best regards, Carlotta