Closed callahantiff closed 2 years ago
Solution for Builds Prior to v3.0.2
: The (bad_node_patch.json
) file contains a dictionary where the outer keys are the entity_uri
and the puter values are another dictionary where the inner keys are label
and description/definition
and the inner values for these inner keys are the updated strings without foreign characters. An example of this dictionary is shown below:
key = '<http://purl.obolibrary.org/obo/UBERON_0000468>'
print(bad_node_patch[key])
>>> {'label': 'multicellular organism', 'description/definition': 'Anatomical structure that is an individual member of a species and consists of more than one cell.'}
The code to identify the nodes with erroneous foreign characters is shown below:
import re
import pandas as pd
# link to downloaded `NodeLabels.txt` file
input_file = `'NodeLabels.txt'`
# load data as Pandas DataFrame
nodedf = pd.read_csv(input_file, sep='\t', header=0)
# identify bad nodes and filter DataFrame so it only contains these rows
nodedf['bad'] = nodedf['label'].apply(lambda x: re.search("[\u4e00-\u9FFF]", x) if not pd.isna(x) else None)
nodedf_bad_nodes = nodedf[~pd.isna(nodedf['bad'])].drop_duplicates()
@ChuckKollar - Just wanted to call to your attention this PR as well, which resulted in some minor changes to the notebooks/OWLNETS_Example_Application.ipynb
file. See here for details. I am happy to make a PR to your repos in the future for changes like this if that would be helpful. Just let me know! 😄
Kudos, SonarCloud Quality Gate passed!
Purpose
This PR address two primary issues: (i) the bug regarding foreign characters included in node labels and definitions described in issue #118 and (ii) the discovery of the HGNC
FTP
site switching tohttp
(ftp://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt
→http://ftp.ebi.ac.uk/pub/databases/genenames/hgnc/tsv/hgnc_complete_set.txt
). This change also triggered an update in the way that URLs are parsed to determine the download and associated site type.Scripts Impacted
Bug
builds/data_preprocessing.py
notebooks/Data_Preparation.ipynb
notebooks/OWLNETS_Example_Application.ipynb
pkt_kg/metadata.py
pkt_kg/utils/kg_utils.py
FTP Link
builds/data_to_download.txt
pkt_kg/utils/data_utils.py
tests/test_data_utils_downloading.py