add SMILES information for DrugBank nodes

saramsey commented 3 years ago

. @chunyuma requested:

Hi Steve,

Can I know where we access the relationship of DRUGBANK IDs and use them in the KG2? I’m asking this because However, I can’t use their API for this goal because the DRUGBANK told me that their API is only for commercial clients. So I’m curious if you will know other efficient ways to access the information for the DRUGBANK IDs used in KG2.

Thank you, Chunyu

If you look at KG2, we don't seem to have SMILES information for DrugBank nodes,

but DrugBank definitely has SMILES information in it, as you can see at this link for DRUGBANK:DB00316 (Tylenol): https://go.drugbank.com/drugs/DB00316

scrolling down the resulting page:

can we include the SMILES information in the node synonym property, for the DrugBank ETL?

saramsey commented 3 years ago

Thank you @ericawood !

kvarforl commented 3 years ago

Nice @ericawood! Is all that's left to decide where to put the SMILES info in the node? or have we decided to leave it out for now? apologies if we've already discussed this and I forgot

dkoslicki commented 3 years ago

For @chunyuma’s DTD model, it would be helpful to put the SMILES info on the node. I have no preference where to put it, but I imagine a property called SMILES would make sense.

ecwood commented 3 years ago

Would you (@chunyuma) like this to be included in the next build?

ecwood commented 3 years ago

Also, we started discussing a place to put sequences here: https://github.com/RTXteam/RTX/issues/1301#issuecomment-806181677

chunyuma commented 3 years ago

Hi @ericawood, it would be helpful to include SMILES in the next build. But actually I've already included them in my local version. If you need them, I can share. Thanks!

ecwood commented 3 years ago

@saramsey Where do you think we should put sequences on nodes? Should we create a new property? If so, what is the correct biolink name for that property?

saramsey commented 3 years ago

OK, since I have been for many months confused about how node attributes are to be represented in a KGX- JSON-serialized Translator knowledge graph, I did a little test using the current version of KGX in the GitHub master branch. I ran this code:

#!/usr/bin/env python3

import networkx
import kgx.sink

g = networkx.MultiDiGraph()
g.add_node('A', name='Node A', category=['biolink:NamedThing'])
g.add_node('A', foo='bar')
s = kgx.sink.JsonSink(filename="foo.json")
for n, data in g.nodes(data=True):
    s.write_node(data)

which produces the following JSON file foo.json:

{
    "nodes": [
        {
            "name": "Node A",
            "category": [
                "biolink:NamedThing"
            ],
            "foo": "bar"
        }

So apparently, if a knowledge graph has a Biolink "attribute", in the JSON serialization via KGX, the attribute is just represented as a node property. I did a bit more work to manually add an empty edges slot, change the id to a proper-looking CURIE, and changed the attribute name to has_biological_sequence, like this:

{
    "nodes": [
        {
            "id": "KEGG:123",
            "name": "Node A",
            "category": [
                "biolink:NamedThing"
            ],
            "has_biological_sequence": "bar"
        }],
    "edges": [
     ]
}

and lo and behold, it validates:

(venv) sramsey-laptop:issue1273 sramsey$ kgx validate -i json foo.json
/Users/sramsey/Work/Proj/ncats-translator/git-master/RTX/code/kg2/venv/lib/python3.7/site-packages/biolinkml/__init__.py:158: UserWarning: Some URL processing will fail with python 3.7.5 or earlier.  Current version: sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
  warn(f"Some URL processing will fail with python 3.7.5 or earlier.  Current version: {sys.version_info}")
WARNING:root:created with is not a valid Biolink Model element
WARNING:root:created_with is not a valid Biolink Model elemen

Note that if we change the property name to sequence, KGX rejects it:

WARNING:root:sequence is not a valid Biolink Model element

So for UniProtKB nodes, we can add a has_biological_sequence property. I think maybe we should also use this for SMILES (with SMILES: as a prefix), until such time as there explicit support for SMILES in Biolink.

saramsey commented 3 years ago

I believe it is very unlikely that a node would have both a SMILES designation and an amino acid sequence.

saramsey commented 3 years ago

Note that this had to be intuited empirically and from studying the KGX source code. It would be good if someone could confirm my conclusions directly with the KGX team. But the fact that has_biological_sequence validates is a good sign, I think.

saramsey commented 3 years ago

Node and edge attributes are among the many reasons why I think, over time, we will end up drifting away from Neo4j. I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2. So we have to use the neo4j-admin TSV import feature, which can't handle node-specific properties, in the following sense: If a property is specified on one node it requires a dedicated column in the TSV file, for all nodes. You can see why that might be a problem, if we have prolifieration of node attributes for various biolink type-specific or CURIE prefix-specific corner-cases. We will see. For now, I'm fine with creating a has_biological_sequence node property. If you put it in kg2_util.add_node that should ensure that it is always defined (it can default to None of course).

ecwood commented 3 years ago

I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2.

The bigger problem than speed was that loading in with JSON format was far more memory intensive.

saramsey commented 3 years ago

I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2.

The bigger problem than speed was that loading in with JSON format was far more memory intensive.

Thank you, I defer to your superior memory, @ericawood

ecwood commented 3 years ago

577edb9 adds the has_biological_sequence node property to kg2_util.py and assigns a value to it in some of the ETL scripts. Here is the sequence "type" added by source:

DrugBank: SMILES
ChemBL: Canonical SMILES
HMDB: SMILES
miRBase: "sequence" (appears to be amino acids)
UniprotKB: "sequence" (also appears to be amino acids)

In a few of those cases (ChemBL, miRBase, and UniprotKB), the information was taken out of another location (such as synonym or description) and instead stored in has_biological_sequence. Of course, I am neither the user nor a biologist, so please correct me if you would like the data handled differently.

I would like to note that SMPDB/PathWhiz also has SMILES information for its compounds. However, at this time, in favor of working to get another build out faster, I am not going to pull that information out as those nodes are collapsed with DrugBank/HMDB later in KG2C anyway and I will do it later when I improve that ETL script overall.

ecwood commented 3 years ago

This issue appears to be solved in KG2.6.0:

match (n) where n.has_biological_sequence <> "" return n.id, n.has_biological_sequence limit 100

RTXteam / RTX

add SMILES information for DrugBank nodes #1273