Closed deepyaman closed 4 years ago
Thanks, this is a good spot! I see you've made a pull request for the change to the data after correcting this, is that a vim command? Do you know if there is a neat way to script this without vim?
@beoutbreakprepared this should be followed up upon to make the data easier to use.
@aezarebski @beoutbreakprepared Sorry, I just saw your comment last night! Here's a one-liner that will detect the encoding of and convert all files:
find . -name '*.csv' -type f -exec bash -c 'iconv --from-code=$(file -b --mime-encoding "$0" | sed 's/^us-ascii$/utf-8/') --to-code=UTF-8 "$0" > "$0".iconv && mv "$0".iconv "$0"' {} \;
Breaking it down:
Run the command for all files with .csv
extension.
find . -name '*.csv' -type f -exec bash -c '<command>' {} \;
Convert from some source encoding to UTF-8, write the file with an .iconv
extension, and replace it (intermediate step required so that you don't overwrite the file before you can process it).
iconv --from-code=$(...) --to-code=UTF-8 "$0" > "$0".iconv && mv "$0".iconv "$0"
Detect the file encoding. It detects us-ascii
for some files with (UTF-8-compliant) Chinese characters, so we force those to UTF-8.
file -b --mime-encoding "$0" | sed 's/^us-ascii$/utf-8/'
I've only shared the changed files in #48, but there are a few ways to go about it:
It's also possible to add CI to validate this, but I don't know how much of this you guys want...
Happy to help if there's anything on this front!
Thanks @deepyaman this is great! Yes, a pre-commit hooko and CI might be a bit overboard for the time being, but putting this in a script sounds like a good solution.
@t-brewer would you mind including the one-liner in the comment above into your scripts so that the output is all UTF-8 encoded please? If you anticipate problems using UTF-8, then ASCII might be sufficient (I think I'm getting that the right way around), but probably best to avoid some old windows encoding if we can.
ncov_hubei.csv
isutf-8
-encoded, butncov_outside_hubei.csv
iscp1252
-encoded. This results in pandas throwing an error if you try to loadncov_outside_hubei.csv
with defaultencoding
:'utf-8' codec can't decode byte 0xa0 in position 162456: invalid start byte
It would make sense to encode them consistently (preferred) or document the different encodings for each file.