Open tfmorris opened 10 years ago
If it's from the book_list.csv that was just pushed in? If so, I'm not sure about the encoding used but it'll be sorted once I get back in the office on Monday (or on Sunday night once I get home)
Yup, that's the one. Thanks for looking into it.
https://github.com/BL-Labs/imagedirectory/blob/master/book_metadata.json is a large file (that due to size can only be got via a repo pull sadly) which contains correctly encoded JSON of the book metadata.
Thanks for the update. I'll have a look at the new file. Is the renaming of book_list.csv in 7c3d01d8d1df378cdf9eb57ccd0849d285bee2c1 acknowledgement that the character encoding isn't correct? If so, could it be recreated from the JSON file? It's easy for me to work with JSON in OpenRefine, but for a lot of people a line-per-record CSV file is significantly easier for their toolchain.
Yep, the renaming was a quick fix. The book_list.csv came from a colleague here and reflects the 'new' electronic record for these books (ie new Aleph/system number/etc). Useful, but with the wrong encoding. (The link that joins the two is elusive here. I may have to fall back to fuzzy matching on titles but I digress.)
I will be putting out a row based file for the metadata soon for exactly the reason you say, but I've just got back into the office so it'll be a little while!
I spoke too quickly when saying this file would be easy to use in OpenRefine. The {bookid:{bookmetadata} structure is problematic for Refine because it expects keys to be constant and the data to be in the value. For it to be able to deal with it would need to be converted to [bookmetadata, bookmetadata, ...], which is also, arguably, more semantically correct. The book metadata piece already includes the identifier field, so it doesn't really add any value to have it as a key for the next level up.
Agreed. The top-level structure is due to my typical usage of the data as a lookup. I'll shift it to a list to make it easier to work with.
The title for the book in row 20 is:
An Election Carol on the House of Commons [with reference to the prosecution of W. T. Stead and others]: or, a parody for the People ... By the author of “Boycotting the Blackguards.�. [electronic resource]
I've tried a variety character encodings including the obvious ISO Latin-1 and UTF-8 without any luck. What encoding is used for the file? Is it used consistently throughout?