callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
157 stars 29 forks source link

Newline char in node description/definition causes dirty lines in node metadata files #116

Closed nomisto closed 2 years ago

nomisto commented 2 years ago

Describe the bug Hello, great work, thanks for providing so much data! I've recently discovered that the metadata files contain "dirty" lines which may be result of newline characters in the source of the description of a node. This is not a breaking bug, but just so you know. So far i've checked version 2.0 (build build_11FEB2021) and 3.0 (build_02OCT2021).

To Reproduce Steps to reproduce the behavior:

  1. Download f.e. https://storage.googleapis.com/pheknowlator/archived_builds/release_v3.0.0/build_02OCT2021/knowledge_graphs/instance_builds/relations_only/owlnets/PheKnowLator_v3.0.0_full_instance_relationsOnly_OWLNETS_NodeLabels.txt
  2. Search for 'http://purl.obolibrary.org/obo/VO_0000247' or go to line 310638 (this is an example, there are a few others that I discovered)
  3. See error: lines 310639-310643 contain text belonging to description of VO_0000247, possibly due to newline character in source.
NODES   369013  <http://purl.obolibrary.org/obo/VO_0000247> vaccine efficacy    Vaccine efficacy is an efficacy of a vaccine in induction of protective immune response in vivo or protection against infection of a virulent pathogen. 
Specifically, vaccine efficacy (VE) is the percentage reduction in disease incidence attributable to vaccination, calculated by means of the following equation:
VE(%) = (U - V)/U x 100
where U = the incidence in unvaccinated people and 
V = the incidence in vaccinated people.
Ref: Hadler TC, et al. Immunization in developing countries. In: Vaccines. Editors: Plotkin S, et al. 2008. p1542-71.   None
NODES   671778  <http://purl.obolibrary.org/obo/CHEBI_165329>   Dinor-PGD2  None    (Z)-5-[(1R,2R,5S)-5-hydroxy-2-[(E,3S)-3-hydroxyoct-1-enyl]-3-oxocyclopentyl]pent-3-enoic acid
callahantiff commented 2 years ago

Thanks for the heads up on this @nomisto! Looking into it now.

callahantiff commented 2 years ago

Hi @nomisto. Thanks again for pointing out this bug!

I have found and repaired the error in the codebase and pushed an update to PyPI. I am currently in the process of updating the node_metadata_dict.pkl and XXXX_NodeLabels.txt files for all v2.0.0 (excluding build_10MAY2020), v2.1.0, and v3.0.0 builds.

I am happy to let you know when that processing is complete. I hope to have it done by Friday at the very latest (ideally by tomorrow).

callahantiff commented 2 years ago

@nomisto - Just to keep you updated on the progress, I have created a list of all of the builds I will be updating and I will check each box once it's complete and ready for use.

Updated Build Metadata

callahantiff commented 2 years ago

@nomisto - everything has been updated. Please feel free to close this issue if everything looks OK to you.

nomisto commented 2 years ago

Thanks @callahantiff, for taking care of this so quickly, everything looks good now!