dedupeio / dedupe-examples

:id: Examples for using the dedupe library
MIT License
406 stars 214 forks source link

UnicodeEncodeError 'charmap' in PGSQL BIG Dedupe Example #72

Closed wedwardbeck closed 6 years ago

wedwardbeck commented 6 years ago

I'm getting an error when running pgsql_big_dedupe_example.py as shown:

blocking... creating blocking_map database creating inverted index writing blocking map Traceback (most recent call last): File "pgsql_big_dedupe_example.py", line 199, in <module> csv_writer.writerows(b_data) File "C:\Program Files\Python36\lib\tempfile.py", line 483, in func_wrapper return func(*args, **kwargs) File "C:\Program Files\Python36\lib\encodings\cp1252.py", line 19, in encode return codecs.charmap_encode(input,self.errors,encoding_table)[0] UnicodeEncodeError: 'charmap' codec can't encode character '\ufffd' in position 12: character maps to <undefined> I'm not sure if my DB (Codepage) is driving this issue - it's PSQL 9.6 on Windows 10. This is at the CSV write of the blocking map, so it can quick load to Postgres. Is this critical? Would it hinder performance greatly to PSQL table 'blocking_map' assuming no easy / quick solution on the unicode issue?

If there are other potential fixes or work arounds that would be great.

wedwardbeck commented 6 years ago

I dug some more on the error in SO and clued into specifying the encoding for the csv writer. Amending line 197 with the encoding='utf-8' helped get past that. Thinking this is a Windows only issue from reading other comments, so will move to Ubuntu to test further.