Develop KGX serialization component for text-mined assertions

bill-baumgartner commented 2 years ago

Text-mined assertions and accompanying metadata are stored in a Cloud SQL DB. This task involves the serialization of the assertions and metadata using the KGX format to generate files that will be shared with other components within the Translator consortium.

Two flavors of KG will be serialized to KGX

Note that we will produce two versions of the text-mined targeted Biolink association KG. The versions will differ in how protein/gene nodes are modeled. By default our text mining pipelines link mentions of proteins in the text to species-non-specific Protein Ontology concepts (when possible). Our manual annotation work has demonstrated increased inter-annotator agreement using this strategy as it is often difficult to determine the precise species of a gene/protein mention in text. In the Protein Ontology, these species-non-specific concepts are ancestors of the species-specific concepts. One version of the serialized KG will use the default species-non-specific Protein Ontology concepts to represent gene/protein nodes.

The species-non-specific protein concepts, however, are not generally used by other Translator components. In order to better integrate with the Translator ecosystem we will produce an alternate KG that maps the species-non-specific protein concepts to human UniProt identifiers (when possible). This strategy has been discussed and, although imperfect, has been agreed upon as a way to initially overcome the gene/protein species issue. The implementation of this approach will require the addition of a mapping table to the underlying database that stores the assertions and associated metadata, e.g. text-mined sentences supporting the assertions. This mapping table will map from species-non-specific Protein Ontology identifiers to an appropriate human UniProt identifier, e.g.

PR	UniProt
PR:000013258	UniProtKB:Q9NR22

Proposed solution

A Docker container that can can interface with the text-mined assertion DB and output data to file in the provisional KGX format that is described in this issue. When invoked, the container will query the DB for all text-mined assertions (excluding all metadata sentences that have been flagged as erroneous and any text-mined assertion that is supported only by erroneous sentences), write the assertions to file using the KGX format, and upload the KGX files to a user-specified GCP bucket. Two sets of KGX files will be generated, one for each of the KG flavors described above.

Input parameters should include:

Google storage bucket paths where the serialized KGX files for each KGX flavor will be uploaded
DB connection parameters

Additional context

Note that there is a KGX library available. There have been some recent changes to add functionality to handle the new TRAPI attribute model. Those look to be complete, although there is an open issue related to the _attributes field that is relevant to this particular feature request.
Eventually the serialized KGX files will be uploaded to a KGX file registry which is undergoing final testing at the moment I believe.
The KGX node file requires a name and category to accompany each identifier. We will make use of the SRI Node Normalization service to retrieve a label and category for each node identifier as this data is not present in the Cloud SQL DB. The name selected should be the canonicalized label as defined by the SRI Node Normalization service (found under id --> label. The category selected should be the most specific category listed in the array of categories returned by the SRI Node Normalizer service (the first element of the type array). Names and categories, once retrieved from the SRI Node Normalizer service will be cached in the Cloud SQL DB to speed up future processing and avoid redundant calls to the service. It is possible there will be identifiers that are not recognized by the SRI Node Normalizer service. In such cases, we will flag these identifiers for later inspection and will use UNKNOWN_NAME and biolink:NamedThing as placeholders for the name and category fields, respectively.

Output from the SRI Node Normalizer service is shown below for the input identifier CHEBI:17824. Note that we will make use of the canonical label Isopropyl alcohol even though the official label of the CHEBI concept is different (propan-2-ol). The category selected is the first in the type array, so biolink:SmallMolecule in this case. Also note that it is possible to batch multiple identifiers in the request to the SRI Node Normalizer (see the example provided here for details).

{
  "CHEBI:17824": {
    "id": {
      "identifier": "PUBCHEM.COMPOUND:3776",
      "label": "Isopropyl alcohol"
    },
    "equivalent_identifiers": [
      {
        "identifier": "PUBCHEM.COMPOUND:3776",
        "label": "Isopropyl alcohol"
      },
      {
        "identifier": "CHEMBL.COMPOUND:CHEMBL582",
        "label": "ISOPROPYL ALCOHOL"
      },
      {
        "identifier": "UNII:ND2M416302",
        "label": "ISOPROPYL ALCOHOL"
      },
      {
        "identifier": "CHEBI:17824",
        "label": "propan-2-ol"
      },
      {
        "identifier": "DRUGBANK:DB02325"
      },
      {
        "identifier": "MESH:D019840",
        "label": "2-Propanol"
      },
      {
        "identifier": "CAS:21388-65-8"
      },
      {
        "identifier": "CAS:33225-60-4"
      },
      {
        "identifier": "CAS:67-63-0"
      },
      {
        "identifier": "DrugCentral:4215",
        "label": "isopropanol"
      },
      {
        "identifier": "HMDB:HMDB0000863",
        "label": "Isopropyl alcohol"
      },
      {
        "identifier": "KEGG.COMPOUND:C01845",
        "label": "Propan-2-ol"
      },
      {
        "identifier": "INCHIKEY:KFZMGEQAYNKOFK-UHFFFAOYSA-N"
      }
    ],
    "type": [
      "biolink:SmallMolecule",
      "biolink:MolecularEntity",
      "biolink:ChemicalEntity",
      "biolink:PhysicalEssence",
      "biolink:NamedThing",
      "biolink:Entity",
      "biolink:PhysicalEssenceOrOccurrent"
    ]
  }
}

bill-baumgartner commented 2 years ago

@edgargaticaCU - Just a heads up that SRI has released a new version of the Node Normalizer. I think there are two changes that we need to consider in our use of it when producing our KGX files:

This new version supports gene/protein conflation, and the conflation flag is true by default (see the docs here). For our use case, I think we want to set the conflation flag to false and we can allow users to activate conflation when they query as desired.
This new version no longer orders the categories that are returned. I believe we are assuming that the first category in the returned list is the most specific. Since this is no longer the case, we will need to determine which category is the most specific from those that are returned. We can make use of the Biolink ontology for this, and I've been told (Thanks @callahantiff!) that the RDFLib library would be a good resource to use for this in Python. We should be able to make use of this version of the Biolink OWL file and @callahantiff has a function called gets_entity_ancestors that could be used for this with rel set to rdfs:subClassOf.

bill-baumgartner commented 2 years ago

Just to follow up. From Slack conversations, it sounds like SRI is considering adding the ordering of categories back into the Node Normalizer, so perhaps we can wait and see if that happens before addressing (2). While we wait, we might want to hardcode the categories based on the concept identifier prefixes. CHEBI identifiers can be mapped to biolink:ChemicalEntity, and PR and UniProt identifiers to biolink:Protein.

NCATSTranslator / Text-Mining-Provider-Roadmap