deweylab / RSEM

RSEM: accurate quantification of gene and isoform expression from RNA-Seq data
http://deweylab.biostat.wisc.edu/rsem/
GNU General Public License v3.0
408 stars 118 forks source link

RSEM Handling of GENCODE Genes #32

Closed DarioS closed 7 years ago

DarioS commented 7 years ago

GENCODE Genes sometimes uses the same gene symbol twice or more. For example,

  seqnames    start      end width strand  source type score phase                ID             gene_id      gene_type gene_status gene_name level
1     chr7 63505821 63538927 33107      +  HAVANA gene    NA    NA ENSG00000214652.5 ENSG00000214652.5_2 protein_coding       KNOWN    ZNF727     2
2     chr7 63505821 63538927 33107      + ENSEMBL gene    NA    NA ENSG00000257482.3   ENSG00000257482.3 protein_coding       KNOWN    ZNF727     3

The transcripts associated with these duplicated gene symbols have identically located exons, introns, and UTRs. I notice that even if I see reads mapped to the genome overlapping these kinds of genes, RSEM will calculate an expected count of 0 for both genes in all samples in genes.results. Shouldn't rsem-prepare-reference handle this case better?

DarioS commented 7 years ago

This is caused by a bug in the GENCODE genes lift-over software and will be fixed by GENCODE version 26, according to the GENCODE project support staff. Therefore, rsem-prepare-reference doesn't need to accommodate this.

bli25wisc commented 7 years ago

Hi @DarioS , glad to hear that the issue is resolved.