dcppc / crosscut-metadata

7 stars 6 forks source link

Comments on plan for AGR/MGI in the README #21

Open cmungall opened 6 years ago

cmungall commented 6 years ago

I see the README has details on how Alliance data is to be encoded in DATS, thanks for adding this.

Are there any example JSON files?

The README says:

AGR/MGI encoding The preliminary encoding for the MGI mouse reference genome annotation is quite simple

This is a bit confusing. The AGR (preferred name: Alliance) is more than MGI. Is the plan to get data directly from MGI? Or to get mouse data from the Alliance (which may temporarily be less complete than what is obtained from MGI), or to get all species data from the Alliance.

I think it should be all species data, not sure why MGI is highlighted specifically?

The HomoloGene ids and HomoloGene-derived human gene ids in relatedIdentifiers...

Human homologs should be obtained from the Alliance, this will be more accurate than Homologene

Overall comments:

The KC7 products google doc says that expression data will be captured from the Alliance (or at least from MGI) but the example in the README is just the basic gene information. Also the Alliance is producing gene to phenotype that is of broad interest. How should this be resolved?

It looks like the datamodel used is a generic one in which arbitrary Dimensions and CategoryValuePairs can be attached to abritrary molecular entities. I think there are some advantages to such a generic model but I question whether this is the best way of representing what is in knowledge bases like the Alliance. It feels like an impedance mismatch. In the diagram:

image

This just seems like a slightly awkward way of expressing what can be expressed more accurately in a line of GFF3 or in the Alliance's own native JSON format. It's not clear how well the dimension model will adapt to richer data from the alliance, e.g. expression or phenotype.

I propose that we simultaneously evaluate the biolink model for knowledge resources such as the Alliance. This would incur additional cost on the full stacks if they want to support both but it would be interesting to compare.

proccaserra commented 6 years ago

@cmungall this initial "MGI" DATS file is more a range finding exercise than anything else. We have discussed several times already that for such molecular information/genome annotation information, there may be little value in creating yet another representation. If you and the alliance can produce a JSON instance and/or the RDF/xml for AGR information, it could be used to complement DATS coverage of datasets. This brings again the key question: what are the query cases ? how do people want to cast their net. We are all reading the use-cases documents.

proccaserra commented 6 years ago

@cmungall also see issue #20 https://github.com/dcppc/crosscut-metadata/issues/20, which discusses similar issues to those you raised. We discussed this with @jonathancrabtree and @agbeltran.