RMLio / RML-Mapper

Generate High Quality Linked Data from multiple originally (semi-)structured data (legacy)
http://RML.io
52 stars 20 forks source link

UTF-8 encoding should be defalt for cell in CSVParser #20

Open seralf opened 7 years ago

seralf commented 7 years ago

Hi the CSVProcessor assumes a different enconding than UTF-8 when reading cells: CSVProcessor.java#L72

here is a snippet:

for (String header : reader.getHeaders()) {
  row.put(new String(header.getBytes("iso8859-1"), UTF_8), reader.get(header));
}

I suggest to read the bytes by default in UTF-8 instead, and add a property "encoding" with some default (for example again "UTF-8"), as suggested b the CSVW vocabulary itself: https://www.w3.org/ns/csvw#encoding

seralf commented 7 years ago

(the problem it's someway similar to this https://github.com/RMLio/RML-LogicalSourceHandler/issues/1)