Open cybersiddhu opened 9 years ago
A GFF3 post-processor script that would extract information from GenBank assembly page and load it in chado. The script will also create links between downstream features. It might take the assembly id or taxon id as input, however it needs a little bit of trial and error before settling on the one that works.
Asked chado schema group for ideas.
An implementation to look at as suggested in the mailing list.
Context
This is the case when the same organism is sequenced multiple times, then there has to be a way to capture the information. Once we have the information we will be able to figure out which particular build this genome belongs or this is the canonical build etc. At dictybase, this will happen in case the strain is sequenced multiple times. This will be seldom but it is definitely possible, for example, multiple research group has sequenced the canonical AX4 strain. It's wise to have the provision in the data model.
Data model for implementation
As discussed in the chado mailing list, there are few options with their ups and downs.
assembly
cvterm to type the feature and feature relationmember_of
to relate the chromosomes and contigs. Notes: The downside would be to have that fake grouping feature, this will be always hacky. The other solution is to plug in the chado group module, however it is still not in final shape to be released.analysis
andanalysis_feature
tables to model the assembly. Represent the assembly as ananalysis
entry and then link the required feature throughanalysis_feature
. This is quite clear, straightforward, less hacky and chado centric. So, lets stick with this model.Other avenues to explore
chado group module
, have been discussed and being actively or planned to be used in few places.