DSpace-Labs / SAFBuilder

Builds a Simple Archive Format package from files and a spreadsheet
https://wiki.duraspace.org/display/DSPACE/Simple+Archive+Format+Packager
45 stars 35 forks source link

Input CSV with diacritics will become invalid #3

Closed peterdietz closed 10 years ago

peterdietz commented 10 years ago

The SAFBuilder uses the default CsvReader, which defaults to ISO-8851, which for the english language, and some European languages doesn't appear to cause any issue. However, there are other languages where this default isn't sufficient, and causes errors/invalid text.

So, if it encounters UTF-8 text: "Munín Sánchez, Lara M.", it produces invalid text such as:

<dcvalue element="contributor" qualifier="author">MunÌ_n SÌÁnchez, Lara M.</dcvalue>

This is because the input was UTF-8, and SAFBuilder reads it as ISO-8851, causing the error.

To remedy this, we will force SAFBuilder to use UTF-8 all the time. I suppose we could detect the input, but that can complicate things. Let's all stick with UTF-8.

peterdietz commented 10 years ago

This fixes the issue, for the example above, the output for the author will be:

<dcvalue element="contributor" qualifier="author">Munín Sánchez, Lara M.</dcvalue>