Representing multiple assemblies of the same organism

cybersiddhu commented 9 years ago

Context

This is the case when the same organism is sequenced multiple times, then there has to be a way to capture the information. Once we have the information we will be able to figure out which particular build this genome belongs or this is the canonical build etc. At dictybase, this will happen in case the strain is sequenced multiple times. This will be seldom but it is definitely possible, for example, multiple research group has sequenced the canonical AX4 strain. It's wise to have the provision in the data model.

Data model for implementation

As discussed in the chado mailing list, there are few options with their ups and downs.

Each assembly has its own organism entry. It could be done by appending assembly id to the species value. However, it creates fake organism entries that is less desirable. You also have to do extra work to get all information for a particular organism with different assembly.
Create an assembly feature for grouping. It will modeled around the concept of GenBank and Ensembl handling of assemblies where you create a chado feature to represent the assembly. An example genome assembly page from NCBI. The information gather from this could easily be turned into an chado feature for assembly. The assembled features like chromosomes and contigs will be its biological descendants. Representing the entry: Use assembly cvterm to type the feature and feature relation member_of to relate the chromosomes and contigs. Notes: The downside would be to have that fake grouping feature, this will be always hacky. The other solution is to plug in the chado group module, however it is still not in final shape to be released.
Use analysis and analysis_feature tables to model the assembly. Represent the assembly as an analysis entry and then link the required feature through analysis_feature. This is quite clear, straightforward, less hacky and chado centric. So, lets stick with this model.
Other avenues to explore
As mentioned before chado group module, have been discussed and being actively or planned to be used in few places.
Karl Pinc's data model about storing genomes from different population of the same organism.

cybersiddhu commented 9 years ago

Tied to https://github.com/dictyBase/Migration/issues/5

cybersiddhu commented 9 years ago

Software implementation

A GFF3 post-processor script that would extract information from GenBank assembly page and load it in chado. The script will also create links between downstream features. It might take the assembly id or taxon id as input, however it needs a little bit of trial and error before settling on the one that works.

cybersiddhu commented 9 years ago

Asked chado schema group for ideas.

cybersiddhu commented 9 years ago

An implementation to look at as suggested in the mailing list.

dictyBase / Modware-Loader