Specs for loading core genomes

Representating strain or subspecies

As suggested in chado documentation append it to the species value. So for canonical dicty entry it becomes Dictyostelium discoideum AX4.

Representing multiple assemblies of the same organism

As discussed in the chado mailing list, there are few options with their ups and downs.

Each assembly has its own organism entry. It could be done by appending assembly id to the species value. However, it creates fake organism entries that is less desirable. You also have to do extra work to get all information for a particular organism with different assembly.
Organism grouping. This would be the optimal one, however there is nothing available in default chado.
Create an assembly feature for grouping. It will modeled around the concept of GenBank and Ensembl handling of assemblies where you create a chado feature to represent the assembly. An example genome assembly page from NCBI. The information gather from this could easily be turned into an chado feature for assembly. The assembled features like chromosomes and contigs will be its biological descendants. Representing the entry: Use assembly cvterm to type the feature and feature relation member_of to relate the chromosomes and contigs. Notes: The downside would be to have that fake grouping feature, this will be always hacky. The other solution is to plug in the chado group module, however it is still not in final shape to be released.
:heavy_check_mark: Use analysis and analysis_feature tables to model the assembly. Represent the assembly as an analysis entry and then link the required feature through analysis_feature. This is quite clear, straightforward, less hacky and chado centric. So, lets stick with this model.
Versioning feature entries with sequence

A versioning model will be applied for majority of the sequence features. The idea will be primarilly borrowed from GenBank. Every feature will have an sequence id(internal and akin to GID or PID in GenBank) and stable identifier(accession no in GenBank). The stable identifier always starts with version 1. Any change in feature sequence will create a new feature entry with a new sequence id whereas the stable identifier remain intact and increment its version number(1 becomes 2). In other words, all features with identical stable identifiers will differ in their version and the one with higher version would be the canonical one. The feature history of sequence changes will also be preserved.

dictyBase / Migration

Specs for loading core genomes #5

Representating strain or subspecies

Representing multiple assemblies of the same organism

Versioning feature entries with sequence

31