callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
157 stars 29 forks source link

Issue 118 #119

Closed callahantiff closed 2 years ago

callahantiff commented 2 years ago

Purpose

This PR address two primary issues: (i) the bug regarding foreign characters included in node labels and definitions described in issue #118 and (ii) the discovery of the HGNC FTP site switching to http (ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txthttp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt). This change also triggered an update in the way that URLs are parsed to determine the download and associated site type.

Scripts Impacted

Bug

FTP Link

callahantiff commented 2 years ago

Solution for Builds Prior to v3.0.2: The (bad_node_patch.json) file contains a dictionary where the outer keys are the entity_uri and the puter values are another dictionary where the inner keys are label and description/definition and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:

key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'

print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}

The code to identify the nodes with erroneous foreign characters is shown below:

import re
import pandas as pd

# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`

# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)

# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()
callahantiff commented 2 years ago

@ChuckKollar - Just wanted to call to your attention this PR as well, which resulted in some minor changes to the notebooks/OWLNETS_Example_Application.ipynb file. See here for details. I am happy to make a PR to your repos in the future for changes like this if that would be helpful. Just let me know! 😄

sonarcloud[bot] commented 2 years ago

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 0 Code Smells

100.0% 100.0% Coverage
0.0% 0.0% Duplication