LibrariesHacked / openlibrary-search

Searching Open Library by keywords to return ISBNs
147 stars 26 forks source link

Error loading editions #1

Closed thedug closed 1 year ago

thedug commented 2 years ago

I'm getting this error when trying to copy editions. The next step in the readme is to update editions and set the work key but I'm getting an error about missing data for work_key when even trying to do the copy.

Do I need to add another column to the end of each line of the editions dump?

postgres=# COPY editions FROM '/Users/thedug/Documents/Workspace/ol_dump/ol_dump_editions_2021-11-30_processed.csv' DELIMITER E'\t' QUOTE '|' CSV; ERROR: missing data for column "work_key" CONTEXT: COPY editions, line 1: "/type/edition /books/OL10001035M 2 2010-03-11T23:52:40.542344 {"publishers": ["Stationery Office Boo..."

thedug commented 2 years ago

I ended up dropping the column and and will re add it.

I also ran into an issue will null chars. I used this to remove them.

tr < ol_dump_editions_2021-11-30_processed.csv -d '\000' > ol_dump_editions_2021-11-30_processed_nonulls.csv

DaveBathnes commented 2 years ago

Hi @thedug thanks for raising this!

I was slightly surprised by the sudden interest in these notes but realised OpenLibrary had seen them and linked to the repository. They were written quite a while ago without really expecting anyone to look at them, so I imagine quite a few things have changed now - but it looks like you've done a good job of tackling that problem!

I've got some time booked in to properly look at this repository though and make sure there's some decent scripts, so thanks for your notes on this.

gennaios commented 2 years ago

By chance, I also happened to find such recently and have an interest. I’ll be importing into Sqlite. I’m not sure what would be some ideal approach but perhaps reformatting the dumps such that one could import into any db? If such is possible and you’d consider such, that’d be great.

DaveBathnes commented 1 year ago

Just a quick update on this. I'm almost there with a significant refactor which will properly script the database creation (all currently in this branch https://github.com/LibrariesHacked/openlibrary-search/tree/1-error-loading-editions). I've been testing today but due to the sheer size of data it's been going all day. The first attempt failed when disk space ran out!

@thedug On your original question - you were right of course with the work_key causing an error, so dropping it would have fixed and then recreating once the copy import is done. I think in my notes I must have omitted the fact that I start off with the table with only the columns to enable the copy command to work, then add the column and populate it. The editions table just needs the work_key added to link with the works table. Then the authorship table links the authors and works tables.

@gennaios I think it would definitely be a good enhancement to then make it database agnostic. Once I have the database scripts I'd like to refactor them to allow for multiple database engines. There are plenty of complexities to that - indexing the json column for example, which is a particular command to PostgreSQL, and the copy commands which are by far the quickest way of getting the data in to a PostgreSQL DB, but something more general would work across DBs.

Thanks for the feedback and apologies for very late replies!

Xaneets commented 1 year ago

@DaveBathnes

Just a quick update on this. I'm almost there with a significant refactor which will properly script the database creation (all currently in this branch https://github.com/LibrariesHacked/openlibrary-search/tree/1-error-loading-editions). I've been testing today but due to the sheer size of data it's been going all day. The first attempt failed when disk space ran out!

To speed up data cleaning, you can use a Fast-Open-Library based on the rust language. It clears data 6 > times faster