blachlylab / prex-py

prex Extracts Promoter Sequences
GNU General Public License v3.0
0 stars 0 forks source link

APPRIS principal isoform 1 not in all gene records #3

Open jblachly opened 8 years ago

jblachly commented 8 years ago

Objective

identify principal isoform as default record when a gene identifier is passed

See here: http://useast.ensembl.org/Homo_sapiens/Help/Glossary?id=521

PRINCIPAL:1 Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.

kwkroll32 commented 8 years ago

there's a new column in the GTF dataframe for appris_principal values. Any value is valid [1, 2, 3, 4, 5, etc.], but there must only be 1 principal isoform start codon. Technically, it will not crash if two different transcripts are called the primary isoform, but it will crash if they have different start codons.

TO-DO: isoform trumping? i.e. principal 1 > principal 2 > principal 3 ...