TGAC / earlham-galaxytools

Galaxy tools and workflows developed at the Earlham Institute
https://tgac.github.io/earlham-galaxytools/
MIT License
14 stars 13 forks source link

Deal with multiple CDS IDs for the same transcript #120

Open nsoranzo opened 4 years ago

nsoranzo commented 4 years ago

in the gstf_preparation tool.

Biologically, a single mRNA can lead to different CDSs (and therefore protein translations) due to alternative translational start sites. This is in fact allowed in the GFF3 standard: https://github.com/The-Sequence-Ontology/Specifications/blob/master/gff3.md (look for "alternative translational start sites"). If a CDS is discountinuous, its fragments must use the same ID, so the ID can be used to group the fragments composing the various alternative CDSs.

Ensembl seem to enforce the "one CDS per transcript" rule in its databases, but we don't have to.

Additional problem: same GFF3 files (e.g. the one in the gstf_preparation tool help!) use different IDs for fragments of the same CDS, which I think is non-standard.