Inconsistent character encoding across data files

beoutbreakprepared / nCoV2019

Location for summaries and analysis of data related to n-CoV 2019, first reported in Wuhan, China

MIT License

658 stars 257 forks source link

Inconsistent character encoding across data files #37

Closed deepyaman closed 4 years ago

deepyaman commented 4 years ago

ncov_hubei.csv is utf-8-encoded, but ncov_outside_hubei.csv is cp1252-encoded. This results in pandas throwing an error if you try to load ncov_outside_hubei.csv with default encoding:

'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte

It would make sense to encode them consistently (preferred) or document the different encodings for each file.

aezarebski commented 4 years ago

Thanks, this is a good spot! I see you've made a pull request for the change to the data after correcting this, is that a vim command? Do you know if there is a neat way to script this without vim?

@beoutbreakprepared this should be followed up upon to make the data easier to use.

deepyaman commented 4 years ago

@aezarebski @beoutbreakprepared Sorry, I just saw your comment last night! Here's a one-liner that will detect the encoding of and convert all files:

find . -name '*.csv' -type f -exec bash -c 'iconv --from-code=$(file -b --mime-encoding "$0" | sed 's/^us-ascii$/utf-8/') --to-code=UTF-8 "$0" > "$0".iconv && mv "$0".iconv "$0"' {} \;

Breaking it down:

Run the command for all files with .csv extension.

find . -name '*.csv' -type f -exec bash -c '<command>' {} \;

Convert from some source encoding to UTF-8, write the file with an .iconv extension, and replace it (intermediate step required so that you don't overwrite the file before you can process it).
```
iconv --from-code=$(...) --to-code=UTF-8 "$0" > "$0".iconv && mv "$0".iconv "$0"
```
Detect the file encoding. It detects us-ascii for some files with (UTF-8-compliant) Chinese characters, so we force those to UTF-8.
```
file -b --mime-encoding "$0" | sed 's/^us-ascii$/utf-8/'
```

deepyaman commented 4 years ago

I've only shared the changed files in #48, but there are a few ways to go about it:

Set up a pre-commit hook that verifies only UTF-8 data files are being added.
Leave it as a formatting script.

It's also possible to add CI to validate this, but I don't know how much of this you guys want...

Happy to help if there's anything on this front!

aezarebski commented 4 years ago

Thanks @deepyaman this is great! Yes, a pre-commit hooko and CI might be a bit overboard for the time being, but putting this in a script sounds like a good solution.

@t-brewer would you mind including the one-liner in the comment above into your scripts so that the output is all UTF-8 encoded please? If you anticipate problems using UTF-8, then ASCII might be sufficient (I think I'm getting that the right way around), but probably best to avoid some old windows encoding if we can.