Open jblachly opened 8 years ago
there's a new column in the GTF dataframe for appris_principal values. Any value is valid [1, 2, 3, 4, 5, etc.], but there must only be 1 principal isoform start codon. Technically, it will not crash if two different transcripts are called the primary isoform, but it will crash if they have different start codons.
TO-DO: isoform trumping? i.e. principal 1 > principal 2 > principal 3 ...
Objective
identify principal isoform as default record when a gene identifier is passed
See here: http://useast.ensembl.org/Homo_sapiens/Help/Glossary?id=521
PRINCIPAL:1 Transcript(s) expected to code for the main functional isoform based solely on the core modules in the APPRIS. PRINCIPAL:2 Where the APPRIS core modules are unable to choose a clear principal variant (approximately 25% of human protein coding genes), the database chooses two or more of the CDS variants as "candidates" to be the principal variant. PRINCIPAL:3 Where the APPRIS core modules are unable to choose a clear principal variant and there more than one of the variants have distinct CCDS identifiers, APPRIS selects the variant with lowest CCDS identifier as the principal variant. The lower the CCDS identifier, the earlier it was annotated. PRINCIPAL:4 Where the APPRIS core modules are unable to choose a clear principal CDS and there is more than one variant with a distinct (but consecutive) CCDS identifiers, APPRIS selects the longest CCDS isoform as the principal variant. PRINCIPAL:5 Where the APPRIS core modules are unable to choose a clear principal variant and none of the candidate variants are annotated by CCDS, APPRIS selects the longest of the candidate isoforms as the principal variant.