Requesting feedback on mapping files

aegururaj commented 6 years ago

The Oxygen team has generated mapping files for mapping the metadata to the crosscut metadata model aka DATS model. We would like to request feedback and engage in a discussion on improving the mappings. The PR for the mapping files is https://github.com/dcppc/crosscut-metadata/pull/22

cmungall commented 6 years ago

I looked at AGR_FB_Mapping (I assume the other MOD files are identical since the format is identical across the Alliance files).

I previously expressed some of my concerns here: https://github.com/dcppc/crosscut-metadata/issues/21

I'm not sure DATS Dimension model is appropriate for representing the Alliance data. Even when representing basic gene information the mapping is lossy. For example, multiple fields with distinct semantics (displayName, prefix, localid) are all mapped to relatedIdentifiers.

On row 22, type is mapped to title - is this a mistake

It looks like the ortholog mapping is lossy , it's not clear how a homology could be performed on the transformed data

It may be the case that I misunderstanding the mappings file. Is there an example DATS JSON file, that would really help.

aegururaj commented 6 years ago

@cmungall We have the Elasticsearch endpoint here that has 5 MGI sample DATS JSON files: MGI/5622662, MGI/5622581, MGI/1346023, MGI/106092, MGI/99205

cmungall commented 6 years ago

Thanks! Can you provide URLs to get to the JSON?

aegururaj commented 6 years ago

Sure, there you go, MGI sample DATS JSON

cmungall commented 6 years ago

Thanks again!

It looks like things are not being mapped at the correct level. For example,

"isAbout": [
    {
        "@type": "MolecularEntity",
        "name": "",
        "taxonomy": [
            {
                "@type": "TaxonomicInformation",
                "name": "Zfp58",
                "identifier": {
                    "identifier": "NCBITaxon:10090",
                    "identifierSource": "NCBITaxon:10090"
                },
                "relatedIdentifiers": [
                    {
                        "identifier": "RIKEN cDNA A530094I17 gene",
                        "identifierSource": "RIKEN cDNA A530094I17 gene",
                        "relationType": "RIKEN cDNA A530094I17 gene"
                    }
                ]
            }
        ],

Zfp58 is the gene symbol, not the the name of the taxon (which should be Mus musculus). Similarly the RIKEN identifiers are at the level of the gene not the taxon.

for the identifiers, there are things like

"relatedIdentifiers": [
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    },
    {
        "identifier": "MGI:99205",
        "relationType": "gene",
        "identifierSource": "MGI"
    },
    {
        "identifierSource": "MGI",
        "identifier": "99205",
        "relationType": "gene"
    },
    {
        "identifierSource": "MGI",
        "identifier": "MGI:99205",
        "relationType": "gene"
    }
],

I suggest having a single canonical identifier and using a CURIE such as MGI:99205, facilitating JSONLD->RDF using a canonical context file

I'm looking for the homology information, it seems to be embedded inside Material objects:

"characteristics": [
    {
        "name": "",
        "@type": "Material",
        "identifier": {
            "identifier": "HGNC:28857"
        },
        "values": [
            "low",
            "false",
            "false",
            "13",
            "67500555",
            "67490167"
        ],
        "relatedIdentifiers": [
            {
                "identifier": "ZNF682"
            },
            {
                "identifier": "ZNF675"
            },
            {
                "identifier": "ZNF430"
            },

I don't really know what a Material is here, or what the list of values is intended to represent.

Overall I'm still not quite sure I grok the datamodel. Each gene is modeled as a DatasetDistribution, the DatasetDistribution conformsTo a SO type such as 'gene', the DatasetDistribution isAbout a MolecularEntity (which doesn't have a type field), the MolecularEntity has characteristics which are Materials, the material also has identifiers, but these seem to be gene symbols. It looks like the materials actually represent the orthologous genes, there is nothing to indicate that these are homologs, and it's not clear why a gene is a MolecularEntity if it's in the species of interest, and a Material in another species.

I'm trying to map this all onto my own mental map of biology and not having much luck.

sarala commented 6 years ago

Hi,

Would it be possible to use the compact identifiers [1] form for all the identifiers? This means you will also be able to resolve the identifiers using identifiers.org or n2t (KC2, team Sodium work).

Cheers, Sarala

[1] Wimalaratne, S.M., et al., Uniform resolution of compact identifiers for biomedical data. Sci Data, 2018. 5: p. 180029.

aegururaj commented 6 years ago

@cmungall thanks for looking into this. We will review and get back to you soon.

bheavner commented 6 years ago

I'm sorry, I'm just seeing this issue. What's the best way for me/TOPMed to get more context about this? I'm not sure what we're reviewing for, or who the best person would be.

aegururaj commented 6 years ago

@bheavner Please let the Oxygen team (Anu) know if you would like to get additional information about the mapping process.

dcppc / data-stewards

Requesting feedback on mapping files #24