BioSchemas / specifications

Issue tracker, technical wiki, and example markup
https://bioschemas.org
54 stars 52 forks source link

Protein Profile for sites that mix species data #345

Open AlasdairGray opened 5 years ago

AlasdairGray commented 5 years ago

Protein profile assumes a single protein for a single species. Sites such as Guide to Pharmacology focus more on the interaction rather than the protein. Thus their page on A1 Receptor includes data about 3 species. While the data can be separated into the different species, there would only be one page identifier.

How should we properly model this with the Protein Profile?

AlasdairGray commented 5 years ago

A first approach could be just to model a single species. Another approach could be to create sub-identifiers of the form https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18#homosapien

AlasdairGray commented 4 years ago

Another option is just to model the protein and not define a species.

ljgarcia commented 4 years ago

taxonomicRange is recommended and takes MANY. Why is that MANY not enough to cover this case? Do you have an example?

AlasdairGray commented 4 years ago

@simondharding I have created a first version of the GtP markup for a Protein. This should conform to the 0.9-DRAFT of the Protein Profile. There are more properties in the profile that can be used to markup the other data that you have on the page. Feel free to extend the example with these properties. You probably need to do it in the species specific sub-parts. For some of the other receptor families, e.g. Enzymes, we can use other profiles.

@ljgarcia it would be good to get your opinion on the modelling. I have created a species-less main entity and then used the hasBioChemEntityPart property to add in the three species that they have in the database. Also, what is the property to use to link a protein to its protein family, and what type should the protein family page have? Is it just a protein?

simondharding commented 4 years ago

@AlasdairGray thanks, this looks promising. We should include the UniProt ID for the protein (for each species) - which property should I use to do this? sameAs ? https://www.uniprot.org/uniprot/P30542

ljgarcia commented 4 years ago

@AlasdairGray Is this "part" a BioChemEntity? { "isEncodedByBioChemEntity": { "@type": "Gene", "name": "adenosine A1 receptor", "identifier": "ADORA1", "hasRepresentation": "1q32.1" }, "taxonomicRange": { "@id": "https://identifiers.org/taxonomy:9606", "@type": "Taxon", "name": "Human" } }, Given the range of hasBioChemEntityPart I am guessing, yes. If so, why is the type not included? I am also guessing these parts are Protein, again, the type should be used. One disadvantage here is the lack of "@id" for the protein parts, so no link the any actual entity.

@simondharding If those parts indeed correspond to a UniProt entry, you could directly use it in your markup, this would solve the protein part problem regarding "@id" "hasBioChemEntityPart": [ { "@id": "http://purl.uniprot.org/uniprot/P30542" } ]

simondharding commented 4 years ago

@AlasdairGray @ljgarcia so something like this; { "@context": "http://schema.org", "@type": "DataRecord", "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18#", "includedInDataset": "https://www.guidetopharmacology.org/index.jsp#dataset", "citation": { "@id": "https://doi.org/10.2218/gtopdb/F3/2019.4", "@type": "ScholarlyPublication" }, "mainEntity": { "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18", "@type": "Protein", "http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/specifications/Protein/0.9-DRAFT", "identifier": "18", "name": "A1 receptor", "description": "class A G protein-coupled receptor", "alternateName": ["RDC7", "adenosine receptor A1", "A1-AR", "A1R"], "url": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=18", "hasBioChemEntityPart": [ { "@id": "http://purl.uniprot.org/uniprot/P30542" }, { "@id": "http://purl.uniprot.org/uniprot/Q60612" }, { "@id": "http://purl.uniprot.org/uniprot/P25099" } ] } }

AlasdairGray commented 4 years ago

Good point @ljgarcia about the lack of type and identifier for the subparts. The two options would be:

  1. Directly using UniProt
    ...
    "hasBioChemEntityPart": [
    { 
    "@id": "http://purl.uniprot.org/uniprot/P30542",  
    "@type": "Protein"
    },
    { 
    "@id": "http://purl.uniprot.org/uniprot/Q60612",  
    "@type": "Protein"
    },
    { 
    "@id": "http://purl.uniprot.org/uniprot/P25099",  
    "@type": "Protein"
    }
    ],
    ...
  2. Using sameAs link
    ...
    "hasBioChemEntityPart": [
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/P30542",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "ADORA1",
          "hasRepresentation": "1q32.1"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:9606",
          "@type": "Taxon",
          "name": "Human"
        }
      },
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/Q60612",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "Adora1",
          "hasRepresentation": "1 E4"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:10090",
          "@type": "Taxon",
          "name": "Mouse"
        }
      },
      {
        "@type": "Protein",
        "sameAs": "http://purl.uniprot.org/uniprot/P25099",
        "isEncodedByBioChemEntity": {
          "@type": "Gene",
          "name": "adenosine A1 receptor",
          "identifier": "Adora1",
          "hasRepresentation": "13q13"
        },
        "taxonomicRange": {
          "@id": "https://identifiers.org/taxonomy:10114",
          "@type": "Taxon",
          "name": "Rat"
        }
      }
    ]
    ...

At this point, UniProt does not have Bisochemas markup, so the second approach means that there will be data available for the construction of the knowledge graph. The first approach gives a more direct link to UniProt, but means that GtP are not making any assertions about the data.

simondharding commented 4 years ago

@AlasdairGray do HGNC have bioschemas mark-up? I wonder if the @type Gene should include the HGNC ID and likewise the mouse and rat MGI IDs and RGD IDs. Rather than the gene symbol as the identifier.

ljgarcia commented 4 years ago

As @AlasdairGray suggests, having the sameAs would link to UniProt and would also provide the data. Once UniProt supports bioschemas markup, it could be removed to avoid duplication.

@simondharding Same as it is done with UniProt proteins, it can also be done with Genes. At https://bioschemas.org/liveDeploys/ I do not see HGNC so an approach similar to the one suggested for UniProt would be the way to go by now. The identifier could still be the gene symbol: if you actually use it as identifier or if HGNC uses as identifier (as this seems to be your reference database for Genes). Ensembl ID could also be a possibility for gene ids.

simondharding commented 4 years ago

Hi @AlasdairGray @ljgarcia I've got the following prepared for the target page on GtoPdb. I've included the proteins and genes all under "hasBioChemEntityPart" . Ideally, I'd use the isEncodedByBioChemEntity as a subclause for each protein. But there are cases where more than one gene and protein per species are included on our target pages. Happy to discuss. But useful to know how this looks.

<!-- BioSchemas Mark-Up For Targets -->
        <script type="application/ld+json">  
            {
                "@context": "http://schema.org", 
                "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19#",
                "@type": "DataRecord",
                "includedInDataset": {
                    "@type": "Dataset",
                    "@id": "https://www.guidetopharmacology.org/index.jsp#dataset"
                },
                "citation": {
                    "@id": "",
                    "@type": "ScholarlyPublication"
                },
                "mainEntity": {
                "@id": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19",
                "@type": "Protein",
                "http://purl.org/dc/terms/conformsTo": "https://bioschemas.org/specifications/Protein/0.9-DRAFT",
                "identifier": "19",
                "name": "A<sub>2A</sub> receptor",
                "description": "A<sub>2A</sub> receptor",
                "alternateName": ["RDC8","A2-AR","adenosine receptor A2a"],
                "url": "https://www.guidetopharmacology.org/GRAC/ObjectDisplayForward?objectId=19",

                "hasBioChemEntityPart": [
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/P29274",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:9606",
                            "@type": "Taxon",
                            "name": "Human"
                        }
                        },
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/Q60613",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10090",
                            "@type": "Taxon",
                            "name": "Mouse"
                        }
                        },
                {
                        "@type": "Protein",
                        "sameAs": "https://www.uniprot.org/uniprot/P30543",
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10116",
                            "@type": "Taxon",
                            "name": "Rat"
                        }
                        },
                {
                            "@type": "Gene",
                            "sameAs": "https://rgd.mcw.edu/rgdweb/report/gene/main.html?id=2049",
                            "name": "Adora2a",
                            "identifier": "2049",
                            "hasRepresentation": "20p12"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10116",
                            "@type": "Taxon",
                            "name": "Rat"
                        },
                {
                            "@type": "Gene",
                            "sameAs": "https://www.genenames.org/data/gene-symbol-report/#!/hgnc_id/HGNC:263",
                            "name": "ADORA2A",
                            "identifier": "263",
                            "hasRepresentation": "22q11.23"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:9606",
                            "@type": "Taxon",
                            "name": "Human"
                        },
      {
                            "@type": "Gene",
                            "sameAs": "http://www.informatics.jax.org/marker/MGI:99402",
                            "name": "Adora2a",
                            "identifier": "MGI:99402",
                            "hasRepresentation": "10"
                        },
                        "taxonomicRange": {
                            "@id": "https://identifiers.org/taxonomy:10090",
                            "@type": "Taxon",
                            "name": "Mouse"
                        }
                ]        
                }
            }
        </script>
<!-- END OF BioSchemas Mark-Up -->
ljgarcia commented 4 years ago

Hi @simondharding

It looks good although having both proteins and genes as targets for hasBioChemEntityPart seems odd (to me).

If you add isEncodedByBioChemEntity to the UniProt proteins, and that points to genes, will you still need the genes as targets of hasBioChemEntityPart?

Also, not sure what you mean by "But there are cases where more than one gene and protein per species are included on our target pages". If that is adding more of your proteins to the mainEntity, then using a list would solve it. If that is adding more proteins/genes to the hasBioChemEntityPart, I am not sure why this would be an issue.

Cheers,

gtsueng commented 2 years ago

So if I have a specific human gene and I want to link the identifiers for all the homologous and orthologous genes, I would model it using 'hasBioChemEntityPart. Then if I want to link the gene to exons of the human gene, it would also be modeled using 'hasBioChemEntityPart'. Similarly, the 'hasBioChemEntityPart' would be used to link both homologous proteins and protein subdomains to a protein. Did I understand this correctly? It feels a little confusing to me to mix actual biochemical parts (exons, subdomains) with the homolog (complete gene/protein in other species).

AlasdairGray commented 2 years ago

sdo:sameAs is not appropriate for all use case although some may choose to use this. Looking at the properties in BioChemEntity the best we currently have would be bioChemSimilarity.

It may be that we want to think about proposing a new property for this case.