PyEED / pyeed

🧬 Toolkit to create, annotate, and analyze specialized sequence databases
https://pyeed.github.io/pyeed/
MIT License
3 stars 5 forks source link

Data model: ProteinSequence and NucleotideSequence #15

Closed itbjpl closed 1 year ago

itbjpl commented 1 year ago

I suggest to generalize and rename the objects ProteinSequence and NucleotideSequence in specifications

  1. Consistent naming: either we call them ProteinSequence and DNASequence, or we call them AminoAcidSequence and NucleotideSequence. I am aware that naming in GenBank is by "protein" and "nucleotide". In contrast, EBI calls the databases "Protein" and "DNA" (https://www.ebi.ac.uk/services/data-resources-and-tools)
  2. Construct the NucleotideSequence object similar to the ProteinSequence object, so we can collect DNA sequences independent of a protien sequence entry. In a second round, we might link several DNA sequences to a single protein sequence (n:1 relation). Each DNA sequence can (but does not need to) refer to a protein sequence. There is no need to refer from a protein sequence to a DNA sequence
  3. Both objects, ProteinSequence and DNASequence, should have the attribute "organism". This information is obtained from the protein sequence or the DNA sequence database. There might be inconsistencies, but this is how the databases tell us (e.g. a protein is from organism1, but it is encoded by two different genes, where gene1 is from organism1 and gene2 from organism2)