Closed saramsey closed 3 years ago
Thank you @ericawood !
Nice @ericawood! Is all that's left to decide where to put the SMILES info in the node? or have we decided to leave it out for now? apologies if we've already discussed this and I forgot
For @chunyuma’s DTD model, it would be helpful to put the SMILES info on the node. I have no preference where to put it, but I imagine a property called SMILES
would make sense.
Would you (@chunyuma) like this to be included in the next build?
Also, we started discussing a place to put sequences here: https://github.com/RTXteam/RTX/issues/1301#issuecomment-806181677
Hi @ericawood, it would be helpful to include SMILES
in the next build. But actually I've already included them in my local version. If you need them, I can share. Thanks!
@saramsey Where do you think we should put sequences on nodes? Should we create a new property? If so, what is the correct biolink name for that property?
OK, since I have been for many months confused about how node attributes are to be represented in a KGX- JSON-serialized Translator knowledge graph, I did a little test using the current version of KGX in the GitHub master branch. I ran this code:
#!/usr/bin/env python3
import networkx
import kgx.sink
g = networkx.MultiDiGraph()
g.add_node('A', name='Node A', category=['biolink:NamedThing'])
g.add_node('A', foo='bar')
s = kgx.sink.JsonSink(filename="foo.json")
for n, data in g.nodes(data=True):
s.write_node(data)
which produces the following JSON file foo.json
:
{
"nodes": [
{
"name": "Node A",
"category": [
"biolink:NamedThing"
],
"foo": "bar"
}
So apparently, if a knowledge graph has a Biolink "attribute", in the JSON serialization via KGX, the attribute is just represented as a node property. I did a bit more work to manually add an empty edges
slot, change the id
to a proper-looking CURIE, and changed the attribute name to has_biological_sequence
, like this:
{
"nodes": [
{
"id": "KEGG:123",
"name": "Node A",
"category": [
"biolink:NamedThing"
],
"has_biological_sequence": "bar"
}],
"edges": [
]
}
and lo and behold, it validates:
(venv) sramsey-laptop:issue1273 sramsey$ kgx validate -i json foo.json
/Users/sramsey/Work/Proj/ncats-translator/git-master/RTX/code/kg2/venv/lib/python3.7/site-packages/biolinkml/__init__.py:158: UserWarning: Some URL processing will fail with python 3.7.5 or earlier. Current version: sys.version_info(major=3, minor=7, micro=4, releaselevel='final', serial=0)
warn(f"Some URL processing will fail with python 3.7.5 or earlier. Current version: {sys.version_info}")
WARNING:root:created with is not a valid Biolink Model element
WARNING:root:created_with is not a valid Biolink Model elemen
Note that if we change the property name to sequence
, KGX rejects it:
WARNING:root:sequence is not a valid Biolink Model element
So for UniProtKB nodes, we can add a has_biological_sequence
property. I think maybe we should also use this for SMILES (with SMILES:
as a prefix), until such time as there explicit support for SMILES in Biolink.
I believe it is very unlikely that a node would have both a SMILES designation and an amino acid sequence.
Note that this had to be intuited empirically and from studying the KGX source code. It would be good if someone could confirm my conclusions directly with the KGX team. But the fact that has_biological_sequence
validates is a good sign, I think.
Node and edge attributes are among the many reasons why I think, over time, we will end up drifting away from Neo4j. I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2. So we have to use the neo4j-admin TSV import feature, which can't handle node-specific properties, in the following sense: If a property is specified on one node it requires a dedicated column in the TSV file, for all nodes. You can see why that might be a problem, if we have prolifieration of node attributes for various biolink type-specific or CURIE prefix-specific corner-cases. We will see. For now, I'm fine with creating a has_biological_sequence
node property. If you put it in kg2_util.add_node
that should ensure that it is always defined (it can default to None
of course).
I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2.
The bigger problem than speed was that loading in with JSON format was far more memory intensive.
I realize that Neo4j supports node and edge properties but the APOC json-importer (last time we tested it) was so slow that it was not usable for KG2.
The bigger problem than speed was that loading in with JSON format was far more memory intensive.
Thank you, I defer to your superior memory, @ericawood
577edb9 adds the has_biological_sequence
node property to kg2_util.py
and assigns a value to it in some of the ETL scripts. Here is the sequence "type" added by source:
In a few of those cases (ChemBL, miRBase, and UniprotKB), the information was taken out of another location (such as synonym
or description
) and instead stored in has_biological_sequence
. Of course, I am neither the user nor a biologist, so please correct me if you would like the data handled differently.
I would like to note that SMPDB/PathWhiz also has SMILES information for its compounds. However, at this time, in favor of working to get another build out faster, I am not going to pull that information out as those nodes are collapsed with DrugBank/HMDB later in KG2C anyway and I will do it later when I improve that ETL script overall.
This issue appears to be solved in KG2.6.0
:
match (n) where n.has_biological_sequence <> "" return n.id, n.has_biological_sequence limit 100
. @chunyuma requested:
If you look at KG2, we don't seem to have SMILES information for DrugBank nodes,
but DrugBank definitely has SMILES information in it, as you can see at this link for
DRUGBANK:DB00316
(Tylenol): https://go.drugbank.com/drugs/DB00316scrolling down the resulting page:
can we include the SMILES information in the node
synonym
property, for the DrugBank ETL?