deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
410 stars 118 forks source link

One of the shortest noncoding transcript is the most abundant according to rsem-calculate-expression #151

Open kaltinel opened 3 years ago

kaltinel commented 3 years ago

Hi,

I started to use RSEM, and I am happy with what it can do! I would like to use it further, to be able to pick the most abundant isoform of each gene, therefore wanted to try rsem-calculate-expression.

I am a little bit puzzled with its results unfortunately.

I can give an example in here for gene EEF1A1 : image where the highlighted transcript is having the highest isoform abundance percentage, which is also one of the shortest one..

In the table format:

Pasted Graphic

We can see that rsem-calculate-expression picks one of the shortest isoform, which is a non coding + retained intron type transcript as the most abundant one. However I would expect a much longer protein coding transcript to be the most abundant (which are coloured in blue for easy visualization).

To be able to run rsem-calculate-expression, I first run STAR and used the output files mapped to transcriptome. And then use them as rsem-calculate-expression input. My command line for the rsem-calculate-expression: rsem-prepare-reference --gtf gencode.v31.primary_assembly.annotation.gtf GRCh38.primary_assembly.genome.fa gencodev31 And then, I merged my STAR outputs (Transcriptome.out.bam files) for the same condition and run: rsem-calculate-expression -p 12 --fragment-length-mean 54.9 --fragment-length-sd 1.20921 --calc-pme --calc-ci --estimate-rspd --num-rspd-bins 20 --keep-intermediate-files --time --alignments merged_RNASEQ_WT_Transcriptome.out.bam gencodev31 RNAseq_WT

I would be happy to hear from you regarding where I am making a mistake in the process itself/interpretation of the results.

Thank you!

chaoyanggu commented 3 years ago

How can you run rsem-calculate-expression normally; the bam file I got from star cannot be recognized;

error : Warning: The SAM/BAM file declares less reference sequences (194) than RSEM knows (225618)! Please make sure that you aligned your reads against transcript sequences instead of genome. RSEM can not recognize reference sequence name chr1! "rsem-parse-alignments ./index/hg38 CD4-T.temp/CD4-T CD4-T.stat/CD4-T ./STAR_v33/CD4-T/CD4-TAligned.sortedByCoord.out.bam 1 -tag XM" failed! Plase check if you provide correct parameters/options for the pipeline!

the codes as follows: /software/RSEM-1.3.3/rsem-prepare-reference -p 16 --gtf ref_annot.gtf --star --star-path /software/STAR-2.7.2b/source/ ref_genome.fa ./index/hg38 /software/RSEM-1.3.3/rsem-calculate-expression -p 16 --alignments --paired-end ./STAR_v33/CD4-T/CD4-TAligned.sortedByCoord.out.bam ./index/hg38 CD4-T

thanks very much.