dchud / ddbench

Benchmarking suite for dedupe
Apache License 2.0
0 stars 0 forks source link

Encoding/char issue with dblp-scholar data #5

Open dchud opened 8 years ago

dchud commented 8 years ago

The id field in one of the dblp-scholar source files isn't coming through correctly. Note the different file formats:

data/dblp-scholar/DBLP-Scholar_perfectMapping.csv: ASCII text, with very long lines, with CRLF line terminators
data/dblp-scholar/DBLP1.csv:                       ISO-8859 English text, with very long lines, with CRLF line terminators
data/dblp-scholar/Scholar.csv:                     UTF-8 Unicode (with BOM) English text, with very long lines, with CRLF line terminators
bbengfort commented 8 years ago

Lovely. Well since we control the data, we should probably just re-encode this all into UTF-8.