Closed netsensei closed 7 years ago
For reference: using the latest version of Catmandu::DBI on OSX Sierra.
Same problem in MySQL and Postgres
@netsensei Look more closely at the field "viaf_alternate", where these characters appear: Nicolò �dell� Abbate
and others too
So it is not only a problem in the field _id
@netsensei oops, those characters were the same in the CSV as you posted here
@nicolasfranck Yeah 😀 I pulled those from VIAF via Catmandu::VIAF. Other characters do work correctly, so those are an issue on their end, it seems to me.
@phochste Tested this. This works for me 👍 Thanks!
@netsensei tested what? ;-)
fixed in 0.0701
Problem
I'm trying to import a CSV file with UTF8 data in a SQLite store using the DBI module. The _id field has values which contain non-ASCII characters (é, à, ë,...) because these are names. When I'm trying to convert the SQLite database back again to JSON, CSV,... the characters in the _id field are garbled:
So this
Abbate, Nicolò dell'
becomes thisAbbate, Nicolò dell'
Steps to reproduce
Given this CSV file called "creator.csv"
I'm import this into a SQLite database using DB like this:
catmandu import CSV to DBI --data_source dbi:SQLite:/tmp/creators.sqlite < creators.csv
And then I go back to JSON (or any other format) with this:
catmandu export DBI --data_source dbi:SQLite:/tmp/creators.sqlite to JSON --pretty 1
Which yields this output:
Notice how the data in the other fields are all okay, but the data in the _id field is not.
Probable causes
The problem is that the _id field is a TEXT type, while all other fields are stored as a binary blob in SQLite. So, that explains why it works perfectly for the other fields like "viaf_alternate". It fails for the _id field because somewhere, somehow the UTF-8 conversion is not done correctly.
On the CLI, inspecting the SQLite database with sqlite3 yields this:
So, if you look at the output, you'll notice that the characters in the _id field are stored as not-encoded. It's just that when Catmandu::DBI retrieves them, somewhere it gets converted / garbled.
Any ideas what could go wrong here?