USM-CHU-FGuyon / BlendedICU

OMOP standardization pipeline for ICU databases
MIT License
23 stars 6 forks source link

UnicodeDecodeError when running 'python 1_extract_mimic3.py' #9

Closed xinyuejohn closed 7 months ago

xinyuejohn commented 7 months ago

Hi, thanks very much for your work and your code!

Error

I was trying to run python 1_extract_mimic3.py and I encountered this error: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xef in position 18312: invalid continuation byte.

  File "/Users/xinyuezhang/BlendedICU/database_processing/dataprocessor.py", line 36, in __init__
    self.ohdsi_med = self._read_json(self.med_file)

Environment

Code: latest version Environment: latest version specified by env.yml

Possible Reason

JSON file (medications_v10.json) the function trying to load is not encoded in UTF-8, which is the default encoding expected by Python's json.load function.

Possible Solution

Solution 1

encode medications_v10.json in UTF-8

Solution 2

Revise _read_json function to include automatic encoding detection using the chardet library. This function first opens the file in binary mode ('rb') to detect the encoding using chardet. Then it re-opens the file in text mode with the correct encoding to parse the JSON.

import json
import chardet

def _read_json(self, pth):
    # Detect encoding
    with open(pth, 'rb') as file:
        detected_encoding = chardet.detect(file.read())['encoding']

    # Now read the file with the detected encoding
    with open(pth, 'r', encoding=detected_encoding) as file:
        return json.load(file)
USM-CHU-FGuyon commented 7 months ago

For some reason I have no trouble running json.load(file) when the file is encoded in ISO-8859-1... Anyways I like solution 2 and just committed it. Thank you for your interest in our work !