dictyBase / Modware-Loader

Various data munging and loading scripts for genome database
2 stars 1 forks source link

Representing multiple assemblies of the same organism #149

Open cybersiddhu opened 9 years ago

cybersiddhu commented 9 years ago

Context

This is the case when the same organism is sequenced multiple times, then there has to be a way to capture the information. Once we have the information we will be able to figure out which particular build this genome belongs or this is the canonical build etc. At dictybase, this will happen in case the strain is sequenced multiple times. This will be seldom but it is definitely possible, for example, multiple research group has sequenced the canonical AX4 strain. It's wise to have the provision in the data model.

Data model for implementation

As discussed in the chado mailing list, there are few options with their ups and downs.

cybersiddhu commented 9 years ago

Tied to https://github.com/dictyBase/Migration/issues/5

cybersiddhu commented 9 years ago

Software implementation

A GFF3 post-processor script that would extract information from GenBank assembly page and load it in chado. The script will also create links between downstream features. It might take the assembly id or taxon id as input, however it needs a little bit of trial and error before settling on the one that works.

cybersiddhu commented 9 years ago

Asked chado schema group for ideas.

cybersiddhu commented 9 years ago

An implementation to look at as suggested in the mailing list.