PHI-base / data

Archives of PHI-base data releases, and other data.
Creative Commons Attribution 4.0 International
7 stars 7 forks source link

PHI-base CSV releases should use UTF-8 encoding #13

Open jseager7 opened 2 years ago

jseager7 commented 2 years ago

Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.

I'm not completely sure what encoding the files use, but using cp1252 encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).

Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:

Record ID                        Record 11248
Pathogen strain    CA14 (?ku70 ?pyrG::AfpyrG)
Name: 11247, dtype: object

@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.

Here's the list of files that fail to load as UTF-8:

We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).


For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set:

martin2urban commented 2 years ago

@jseager7 That is an interesting observation. Could you please make available the python code snippet throwing the error?

jseager7 commented 2 years ago

@martin2urban Here's the code. You'll need to run it in the 'releases' folder of this repository. It will print the filenames for all files that fail to decode as UTF-8.

import glob

release_files = glob.glob(
    'phi-base_v[0-9]-[0-9][0-9]'
    '_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].csv'
)

for filename in release_files:
    with open(filename, encoding='utf8') as file:
        try:
            file.read()
        except UnicodeDecodeError:
            print(filename)
jseager7 commented 2 years ago

Just an update on this: the following two files fail to load even with Windows-1252 encoding:

phi-base_v4-01_2016-05-01.csv fails because a lowercase u with umlaut (ü) in Record 6166 is encoded incorrectly:

... Candida dubliniensis,no data found,W�284,Erythematous candidiasis,Birds, ...

(It should be Wü284)

phi-base_v4-05_2018-05-15.csv fails because it contains a mix of invalid UTF-8 bytes and valid UTF-8 characters. For example, byte 0xC2 at position 14829 looks like a leftover byte from a non-breaking space (0xC2A0).

martin2urban commented 1 year ago

I removed invalid UTF-8 characters from all files mentioned above.

jseager7 commented 1 year ago

@martin2urban Sorry about the late reply but I only just noticed this. It's not safe to simply remove the invalid characters: by doing this, you're changing the names of some strains.

For example, removing the ü from Wü284 makes it W284, which isn't the same strain name. Either the character ü should be replaced with u (without umlaut) or the files should be properly re-encoded as UTF-8.

I'll fix the problem with PHI-base version 4.1 but I'll also need to make sure that removing characters from version 4.5 hasn't caused problems.

jseager7 commented 1 year ago

I've looked at some of the other files modified by removing non-ASCII characters and there are too many data errors introduced by this change, so I've reverted the commit.

As I mentioned before, the only way we can fix the encoding errors properly is by copying values from the original PHI-base spreadsheets for these versions.