callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
157 stars 29 forks source link

Foreign characters in node labels #118

Closed sanyabt closed 2 years ago

sanyabt commented 2 years ago

Describe the bug Hi @callahantiff, I am generating paths using search algorithms in my (extended) PheKnowLator version and noticed foreign characters (mostly Chinese characters) in some of the node labels when mapping node URIs to labels. I checked in the current build of PheKnowLator (instance-based) just to make sure it wasn't just in my build and noticed the same issue for approximately 2000 nodes (examples below).

To Reproduce Steps to reproduce the behavior:

  1. Go to pheknowlator/current_build/knowledge_graphs/instance_builds/inverse_relations/owlnets/PheKnowLator_v3.0.0_full_instance_inverseRelations_OWLNETS_NodeLabels.txt
  2. Search entity: DOID_2622, UBERON_0000468, GO_0051179
  3. Labels appear to be foreign characters as below.
entity_type integer_id entity_uri label description/definition synonym
NODES 166144 http://purl.obolibrary.org/obo/DOID_2622 神经母细胞性肿瘤 None None
NODES 156011 http://purl.obolibrary.org/obo/UBERON_0000468 多细胞生物 Anatomical structure that is an individual mem... body|multi-cellular organism|whole organism|wh...
NODES 11195 http://purl.obolibrary.org/obo/GO_0051179 定位 Any process in which a cell, a

This is from the NodeLabels.txt file in the current_build. As far as I can tell, the issue does not appear to be due to the source ontology labels as my NodeLabels.txt file had foreign characters in labels for different nodes than the above. And a quick look up in the ontologies gives back the class names (_DOID2622 -- neuroblastic tumor). Not sure if you've noticed this as well and it is already on your radar but if not I am happy to look into it more and help with the fix! I created an issue for myself and then realized it might be useful here too.

Here is how I got the list of all nodes with foreign characters in labels from pheknowlator/current_build/knowledge_graphs/instance_builds/inverse_relations/owlnets/PheKnowLator_v3.0.0_full_instance_inverseRelations_OWLNETS_NodeLabels.txt (it's not the most efficient pandas-method so please excuse that 😅 )

import re
import pandas as pd

nodes = []
nodedf = pd.read_csv('pl-build_tc/PheKnowLator_v3.0.0_full_instance_inverseRelations_OWLNETS_NodeLabels.txt', sep='\t')

for i in range(len(nodedf.index)):
    label = nodedf2.at[i, 'label']
    if isinstance(label, str):
        uri = nodedf2.at[i, 'entity_uri']
        if re.search("[\u4e00-\u9FFF]", label):
            nodes.append(uri)
#for specific examples
nodedf.loc[nodedf['entity_uri'] == nodes[0]]

Desktop (please complete the following information):

callahantiff commented 2 years ago

Thanks for the issue @sanyabt and for the awesome information! I actually think I know what's causing this. Let me confirm this evening and get back to you.

callahantiff commented 2 years ago

Hey @sanyabt! I found and resolved the bug. The funny part is that all 2,008 nodes that had an issue in the label were those that were pulled from the Cell Line Ontology. This matters because as it turns out, this ontology includes labels, synonyms, and definitions in multiple languages. While I think it's fine (and maybe even useful) to include the different languages in the synonyms, it's less helpful to have non-English labels and definitions. So, here is what I have done...

key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

Hope this will be useful to you to get around the error. All future builds should no longer contain this error.


Thanks so much again for pointing out this error! Let me know if there is anything else that is needed with respect to this issue! 💪 🙇‍♀️ 😄

sanyabt commented 2 years ago

Thank you!! This is super helpful and I really appreciate you doing it so quickly 😄

Do we know if it was random nodes from the Cell Line Ontology or all of them? It's weird that my NodeLabels file had different nodes with foreign character labels (close to 1900 nodes) but not all of them were the same as the 2008 we found above (example SO_0000704:gene). I might need to fix those as well till I can get to the next build. Thanks again!

callahantiff commented 2 years ago

Thank you!! This is super helpful and I really appreciate you doing it so quickly 😄

Do we know if it was random nodes from the Cell Line Ontology or all of them? It's weird that my NodeLabels file had different nodes with foreign character labels (close to 1900 nodes) but not all of them were the same as the 2008 we found above (example SO_0000704:gene). I might need to fix those as well till I can get to the next build. Thanks again!

Interesting! 🤔 Are all of the nodes that are missing from the dictionary I sent from the SO namespace? I am happy to help you recover those. Code-wise we should still be covered for future builds, but I want to make you sure you are covered for the current build too.

sanyabt commented 2 years ago

I don't think so - I found nodes with GO and UBERON namespaces too. Should I share the file with you?

Sorry for the extra trouble - I can run the recent build if that is easier.

callahantiff commented 2 years ago

No trouble at all! actually, I have a different idea. Give me a few hours and I will send you a new file. Does that sound OK?

callahantiff commented 2 years ago

haha, ignore the re-opening and closing of the issue, my computer just freaked out! Sorry about that 😊 .

sanyabt commented 2 years ago

Sounds good! Haha no worries 😄

callahantiff commented 2 years ago

OK, go ahead and download this file (bad_node_patch.json). Please note that this file contains all ontology classes, not just those with problematic labels, from the base set of merged ontologies. It should cover all of your nodes (I ran a check on my end using your code snippet above to verify).

sanyabt commented 2 years ago

Thank you, this is perfect!

callahantiff commented 2 years ago

Thank you for pointing out this bug!

sanyabt commented 2 years ago

Hi @callahantiff, sorry to open this again but I thought it would be better addressed on this thread - the PR namespace nodes have label = 'N/A' in the bad_node_path.json file 😅

Examples: 'http://purl.obolibrary.org/obo/PR_Q6V1P9', 'http://purl.obolibrary.org/obo/PR_Q9NZV6-1', 'http://purl.obolibrary.org/obo/PR_Q9GZL7', 'http://purl.obolibrary.org/obo/PR_O76083-5', 'http://purl.obolibrary.org/obo/PR_A0A087WT02'...

About 54903 nodes have 'N/A' labels and 54844 are from the PR namespace.

callahantiff commented 2 years ago

haha oh boy, that's what I get for not running tests. OK, re-checking now. Be back in touch soon! 😄

callahantiff commented 2 years ago

OK, so this is not a bug per se, but really a bad assumption that I was making. I assumed that all ontologies would provide labels for the classes they defined. This is unfortunately not always the case (as you proved, hehe 😱 ). I will need to think through where I add the code to address this specifically -- i.e. importing labels from uniprot. It should be no problem since we download most of this data anyways, I just need to extend the Uniprot query.

In case you want to verify this take a look at the http://purl.obolibrary.org/obo/pr.owl file you can serach for the URIs and confirm that the ontology does not include rdfs:label informaiton for many of the classes that retain the Uniprot identifiers.

So, I will play with thinking through how I can extend the current functionality to catch nodes from ontologies when the ontology does not provide a label. Updates to come!

sanyabt commented 2 years ago

Sounds good, thank you so much! Let me know if you need any help :)

callahantiff commented 2 years ago

I finally figured out what was wrong, sorry for the delay. The protein ontology nodes that were missing were the result of changes that the PRO Consortium has made to their endpoint. Prior to October, there was not a limit on the number of rows that you could return when querying their system. Now, they only allow you to download 10,000 rows at one time. This impacted the most recent build, which introduced all of the missing values for the set of most recent build, this resulted in these nodes not fully being added to the merged core set of ontologies (i.e., they appeared as a class, but were missing all of their associated metadata).

I have fixed the code so that we no longer depend on the SPARQL endpoint, which will make for more stable builds in the future (#120) . I also revereted the output in the current_builds directory on GCS to the September build which does not have this error. I will also re-trigger an updated October build later this weekend, which should be available by the end of next week at the latest.

In the meantime, I also updated the bad_node_patch.json file to include all non-forgien characters, but using the core set of ontologies for the September build, which is not missing values for the labels. I hope this will work for you for the next few days until the October build is refreshed which should be free from foreign characters and the weird missing data bug.

Thanks for helping me get to the bottom of this, you are awesome! 😄

sanyabt commented 2 years ago

Ohh that makes so much sense! No wonder I couldn't see the issue with my previous build's NodeLabels file 😂

A huge thank you for fixing it so quickly and creating the JSON file again! I won't start a new build till at least mid-November so take your time :) Have lots to update you about the natural product-drug interactions KG too! Closing the issue now

callahantiff commented 2 years ago

Sounds great! I am really looking forward to hearing those updates!! 😄