callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models
https://github.com/callahantiff/PheKnowLator/wiki
Apache License 2.0
159 stars 29 forks source link

Handling Unicode Encoding Errors in Ontology Metadata #50

Closed callahantiff closed 4 years ago

callahantiff commented 4 years ago

Problem: unicode errors occurring when writing out knowledge graph metadata locally --depending on the OS and Python version used.

Script: metadata.py

Current Solution: encode/decode ontology term labels, definitions, and synonyms and explicitly ignore UnicodeEncodeError.

Proposed Solution: Add functionality to better handle processing of UnicodeEncodeError

ignaciot commented 4 years ago

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

callahantiff commented 4 years ago

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

Thanks for the suggestion! I think this really only applies to Python 2, but it's good to know about!

I believe I have a solid solution now (testing at scale as we speak) and will post it here once the test finishes.

callahantiff commented 4 years ago

OK, I have the solution, which will work for all unicode characters, including characters in foreign languages. The changes I made are described below for each changed script.


Dockerfile

RUN export PYTHONIOENCODING=utf-8


pkt_kg/metadata.py


Will close this error now, feel free to re-open if need be.