Open aegururaj opened 6 years ago
I looked at AGR_FB_Mapping (I assume the other MOD files are identical since the format is identical across the Alliance files).
I previously expressed some of my concerns here: https://github.com/dcppc/crosscut-metadata/issues/21
I'm not sure DATS Dimension model is appropriate for representing the Alliance data. Even when representing basic gene information the mapping is lossy. For example, multiple fields with distinct semantics (displayName, prefix, localid) are all mapped to relatedIdentifiers.
On row 22, type is mapped to title - is this a mistake
It looks like the ortholog mapping is lossy , it's not clear how a homology could be performed on the transformed data
It may be the case that I misunderstanding the mappings file. Is there an example DATS JSON file, that would really help.
@cmungall We have the Elasticsearch endpoint here that has 5 MGI sample DATS JSON files: MGI/5622662, MGI/5622581, MGI/1346023, MGI/106092, MGI/99205
Thanks! Can you provide URLs to get to the JSON?
Sure, there you go, MGI sample DATS JSON
Thanks again!
It looks like things are not being mapped at the correct level. For example,
"isAbout": [
{
"@type": "MolecularEntity",
"name": "",
"taxonomy": [
{
"@type": "TaxonomicInformation",
"name": "Zfp58",
"identifier": {
"identifier": "NCBITaxon:10090",
"identifierSource": "NCBITaxon:10090"
},
"relatedIdentifiers": [
{
"identifier": "RIKEN cDNA A530094I17 gene",
"identifierSource": "RIKEN cDNA A530094I17 gene",
"relationType": "RIKEN cDNA A530094I17 gene"
}
]
}
],
Zfp58 is the gene symbol, not the the name of the taxon (which should be Mus musculus). Similarly the RIKEN identifiers are at the level of the gene not the taxon.
for the identifiers, there are things like
"relatedIdentifiers": [
{
"identifierSource": "MGI",
"identifier": "MGI:99205",
"relationType": "gene"
},
{
"identifier": "MGI:99205",
"relationType": "gene",
"identifierSource": "MGI"
},
{
"identifierSource": "MGI",
"identifier": "99205",
"relationType": "gene"
},
{
"identifierSource": "MGI",
"identifier": "MGI:99205",
"relationType": "gene"
}
],
I suggest having a single canonical identifier and using a CURIE such as MGI:99205, facilitating JSONLD->RDF using a canonical context file
I'm looking for the homology information, it seems to be embedded inside Material objects:
"characteristics": [
{
"name": "",
"@type": "Material",
"identifier": {
"identifier": "HGNC:28857"
},
"values": [
"low",
"false",
"false",
"13",
"67500555",
"67490167"
],
"relatedIdentifiers": [
{
"identifier": "ZNF682"
},
{
"identifier": "ZNF675"
},
{
"identifier": "ZNF430"
},
I don't really know what a Material is here, or what the list of values is intended to represent.
Overall I'm still not quite sure I grok the datamodel. Each gene is modeled as a DatasetDistribution
, the DatasetDistribution conformsTo
a SO type such as 'gene', the DatasetDistribution isAbout
a MolecularEntity
(which doesn't have a type field), the MolecularEntity has characteristics
which are Material
s, the material also has identifiers, but these seem to be gene symbols. It looks like the materials actually represent the orthologous genes, there is nothing to indicate that these are homologs, and it's not clear why a gene is a MolecularEntity if it's in the species of interest, and a Material in another species.
I'm trying to map this all onto my own mental map of biology and not having much luck.
Hi,
Would it be possible to use the compact identifiers [1] form for all the identifiers? This means you will also be able to resolve the identifiers using identifiers.org or n2t (KC2, team Sodium work).
Cheers, Sarala
@cmungall thanks for looking into this. We will review and get back to you soon.
I'm sorry, I'm just seeing this issue. What's the best way for me/TOPMed to get more context about this? I'm not sure what we're reviewing for, or who the best person would be.
@bheavner Please let the Oxygen team (Anu) know if you would like to get additional information about the mapping process.
The Oxygen team has generated mapping files for mapping the metadata to the crosscut metadata model aka DATS model. We would like to request feedback and engage in a discussion on improving the mappings. The PR for the mapping files is https://github.com/dcppc/crosscut-metadata/pull/22