deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
404 stars 117 forks source link

assignment statistics #88

Closed jamesdalg closed 6 years ago

jamesdalg commented 6 years ago

Question-- is there a way to get statistics as to how RSEM performed on a certain dataset in terms of assignment percentage? I was having trouble with a particular dataset in featurecounts with assignment of reads to genes. Is there a way to get basic stats about RSEM performance (rather than aligner performance) of individual runs or experiments? Here is a basic example of just such a set of stats, below. This happens after alignment, using featurecounts. Assigned 696763 Unassigned_Ambiguity 11448 Unassigned_MultiMapping 13953741 Unassigned_NoFeatures 17772725 Unassigned_Unmapped 0 Unassigned_MappingQuality 0 Unassigned_FragementLength 13813566 Unassigned_Chimera 0 Unassigned_Secondary 0 Unassigned_Nonjunction 0 Unassigned_Duplicate 0 I really like RSEM and what it can do (very powerful!), but I'd really like to know how it performed (if there were reads that just couldn't be assigned, etc).

bli25wisc commented 6 years ago

@jamesdalg , thanks for liking RSEM!

Yes, you can find some of the statistics at 'sample_name.stat/sample_name.cnt' file. Here is the description of that file (you can also find this file within the RSEM folder, cnt_file_description.txt):

'#' marks the start of comments (till the end of the line)

*.cnt file contains alignment statistics based purely on the alignment results obtained from aligners

N0 N1 N2 N_tot # N0, number of unalignable reads; N1, number of alignable reads; N2, number of filtered reads due to too many alignments; N_tot = N0 + N1 + N2 nUnique nMulti nUncertain # nUnique, number of reads aligned uniquely to a gene; nMulti, number of reads aligned to multiple genes; nUnique + nMulti = N1;

nUncertain, number of reads aligned to multiple locations in the given reference sequences, which include isoform-level multi-mapping reads

nHits read_type # nHits, number of total alignments.

read_type: 0, single-end read, no quality score; 1, single-end read, with quality score; 2, paired-end read, no quality score; 3, paired-end read, with quality score

The next section counts reads by the number of alignments they have. Each line contains two values separated by a TAB character. The first value is number of alignments. 'Inf' refers to reads filtered due to too many alignments. The second value is the number of reads that contain such many alignments

0 N0 ... number_of_alignments number_of_reads_with_that_many_alignments ...
Inf N2

jyjiey commented 4 years ago

Hello,

Thanks for this thread. May I ask does the program take the mapping percentages or multimapping numbers into account when calculating the count matrix or normalized matrix? I didn't really get clues about how RSEM can infer and use how many genes each read mapped to. To be more specific, I used bowtie2 for alignment and then cleaned the sam file because we used UMI barcodes. I wonder if RSEM calculated that value and used it, then I should probably keep unmapped reads and etc. Thanks so much!

Best, Jie

jyjiey commented 4 years ago

Just to add a comment. The reason I'm struggling about this is that a large number of gene expression level are related to mapping percentages. Thus, I wonder if that is because of the normalization method. Or the mapping to some genes are smaller than the real values because of the more multi-mapping reads in that file. Looking forwards to your reply!

Thanks, Jie