Handling Unicode Encoding Errors in Ontology Metadata

callahantiff / PheKnowLator

PheKnowLator: Heterogeneous Biomedical Knowledge Graphs and Benchmarks Constructed Under Alternative Semantic Models

https://github.com/callahantiff/PheKnowLator/wiki

Apache License 2.0

159 stars 29 forks source link

Handling Unicode Encoding Errors in Ontology Metadata #50

Closed callahantiff closed 4 years ago

callahantiff commented 4 years ago

Problem: unicode errors occurring when writing out knowledge graph metadata locally --depending on the OS and Python version used.

Script: metadata.py

Current Solution: encode/decode ontology term labels, definitions, and synonyms and explicitly ignore UnicodeEncodeError.

Proposed Solution: Add functionality to better handle processing of UnicodeEncodeError

ignaciot commented 4 years ago

I've generally had success with this (ugly) method, to default reading input as UTF8:

import sys
reload(sys)
sys.setdefaultencoding('utf8')

It's worth noting this is generally discouraged for reasons well explained here.

callahantiff commented 4 years ago

I've generally had success with this (ugly) method, to default reading input as UTF8:
import sys
reload(sys)
sys.setdefaultencoding('utf8')
It's worth noting this is generally discouraged for reasons well explained here.

Thanks for the suggestion! I think this really only applies to Python 2, but it's good to know about!

I believe I have a solid solution now (testing at scale as we speak) and will post it here once the test finishes.

callahantiff commented 4 years ago

OK, I have the solution, which will work for all unicode characters, including characters in foreign languages. The changes I made are described below for each changed script.

Dockerfile

Add the following line to ensure that the Python environment within the Docker container has the correct encoding

RUN export PYTHONIOENCODING=utf-8

pkt_kg/metadata.py

Modifying the output_knowledge_graph_metadata() method to:
- Force file writing to use utf-8 encoding
- Adding some error handling to properly encode and decode variables that trigger the UnicodeEncodingError

Will close this error now, feel free to re-open if need be.