Character encoding problem in book list?

BL-Labs / imagedirectory

Manifests of the public domain images uploaded to Flickr Commons, with descriptive information about the books they were taken from.

The Unlicense

73 stars 19 forks source link

Character encoding problem in book list? #3

Open tfmorris opened 10 years ago

tfmorris commented 10 years ago

The title for the book in row 20 is:

An Election Carol on the House of Commons [with reference to the prosecution of W. T. Stead and others]: or, a parody for the People ... By the author of Ã¢â‚¬Å“Boycotting the Blackguards.Ã¢â‚¬ï¿½. [electronic resource]

I've tried a variety character encodings including the obvious ISO Latin-1 and UTF-8 without any luck. What encoding is used for the file? Is it used consistently throughout?

benosteen commented 10 years ago

If it's from the book_list.csv that was just pushed in? If so, I'm not sure about the encoding used but it'll be sorted once I get back in the office on Monday (or on Sunday night once I get home)

tfmorris commented 10 years ago

Yup, that's the one. Thanks for looking into it.

benosteen commented 10 years ago

https://github.com/BL-Labs/imagedirectory/blob/master/book_metadata.json is a large file (that due to size can only be got via a repo pull sadly) which contains correctly encoded JSON of the book metadata.

tfmorris commented 10 years ago

Thanks for the update. I'll have a look at the new file. Is the renaming of book_list.csv in 7c3d01d8d1df378cdf9eb57ccd0849d285bee2c1 acknowledgement that the character encoding isn't correct? If so, could it be recreated from the JSON file? It's easy for me to work with JSON in OpenRefine, but for a lot of people a line-per-record CSV file is significantly easier for their toolchain.

benosteen commented 10 years ago

Yep, the renaming was a quick fix. The book_list.csv came from a colleague here and reflects the 'new' electronic record for these books (ie new Aleph/system number/etc). Useful, but with the wrong encoding. (The link that joins the two is elusive here. I may have to fall back to fuzzy matching on titles but I digress.)

I will be putting out a row based file for the metadata soon for exactly the reason you say, but I've just got back into the office so it'll be a little while!

tfmorris commented 10 years ago

I spoke too quickly when saying this file would be easy to use in OpenRefine. The {bookid:{bookmetadata} structure is problematic for Refine because it expects keys to be constant and the data to be in the value. For it to be able to deal with it would need to be converted to [bookmetadata, bookmetadata, ...], which is also, arguably, more semantically correct. The book metadata piece already includes the identifier field, so it doesn't really add any value to have it as a key for the next level up.

benosteen commented 10 years ago

Agreed. The top-level structure is due to my typical usage of the data as a lookup. I'll shift it to a list to make it easier to work with.