everypolitician / compare_with_wikidata

Library for diffing Wikidata and CSVs
MIT License
2 stars 0 forks source link

Guess the CSV character encoding if none is specified in Content-Type #105

Open mhl opened 6 years ago

mhl commented 6 years ago

daff seems to error when trying to compare strings with encodings UTF-8 and ASCII-8BIT; this typically arises when the CSV files for comparison is returned with a Content-Type header that omits the charset directive, so it uses the 'just a sequence of bytes' encoding ASCII-8BIT. This commit changes behaviour in that case so that the character set is guessed using the charlock_holmes gem and the guessed encoding is used.

Fixes #104

mhl commented 6 years ago

@tmtmtmtm In fact, the higher level test I added (that of DiffOutputGenerator.comparison) wasn't failing for the right reason - it didn't actually trigger the "incompatible character encodings" because that was arising when the output was trying to format a change in a cell with ->; unfortunately the original version of that test instead generated a diff which with --- and +++ output instead. I've fixed this now by making sure there is an ID in common between the two sides, so without the fix it fails for the right reason.

If you agree with my comments there, but don't have time to add such a test yourself soon, I'm happy to add those myself to get this live quicker.

BTW, I'm very happy for you to rewrite these tests as you like - my curiousity about this bug is satisfied :)