Open jseager7 opened 2 years ago
@jseager7 That is an interesting observation. Could you please make available the python code snippet throwing the error?
@martin2urban Here's the code. You'll need to run it in the 'releases' folder of this repository. It will print the filenames for all files that fail to decode as UTF-8.
import glob
release_files = glob.glob(
'phi-base_v[0-9]-[0-9][0-9]'
'_[0-9][0-9][0-9][0-9]-[0-9][0-9]-[0-9][0-9].csv'
)
for filename in release_files:
with open(filename, encoding='utf8') as file:
try:
file.read()
except UnicodeDecodeError:
print(filename)
Just an update on this: the following two files fail to load even with Windows-1252 encoding:
phi-base_v4-01_2016-05-01.csv fails because a lowercase u with umlaut (ü) in Record 6166 is encoded incorrectly:
... Candida dubliniensis,no data found,W�284,Erythematous candidiasis,Birds, ...
(It should be Wü284)
phi-base_v4-05_2018-05-15.csv fails because it contains a mix of invalid UTF-8 bytes and valid UTF-8 characters. For example, byte 0xC2
at position 14829 looks like a leftover byte from a non-breaking space (0xC2A0
).
I removed invalid UTF-8 characters from all files mentioned above.
@martin2urban Sorry about the late reply but I only just noticed this. It's not safe to simply remove the invalid characters: by doing this, you're changing the names of some strains.
For example, removing the ü from Wü284 makes it W284, which isn't the same strain name. Either the character ü should be replaced with u (without umlaut) or the files should be properly re-encoded as UTF-8.
I'll fix the problem with PHI-base version 4.1 but I'll also need to make sure that removing characters from version 4.5 hasn't caused problems.
I've looked at some of the other files modified by removing non-ASCII characters and there are too many data errors introduced by this change, so I've reverted the commit.
As I mentioned before, the only way we can fix the encoding errors properly is by copying values from the original PHI-base spreadsheets for these versions.
Trying to open the PHI-base 4.12 CSV file as UTF-8 (in Python) throws an error because the file is not valid UTF-8.
I'm not completely sure what encoding the files use, but using
cp1252
encoding doesn't throw any errors (that's the Windows-1252 encoding, a legacy default for many Windows components).Windows-1252 isn't appropriate for PHI-base now (if it ever was) because some columns (e.g. 'Pathogen strain' and 'Host strain') contain characters outside of the Windows-1252 encoding range, such as the delta symbol (Δ). These symbols end up replaced with question marks. Here's an example from the PHI-base 4.12 CSV:
@martin2urban What program did you use to generate these CSV files? If I remember correctly, Microsoft Excel doesn't default to UTF-8 when saving as CSV and has to be manually configured to save in UTF-8 encoding.
Here's the list of files that fail to load as UTF-8:
We should really convert these files to UTF-8 by regenerating them from the original datasets (if possible).
For completeness, here's the list of valid files: those that are either UTF-8 encoded, or contain no characters outside of the ASCII character set: