BloomTech-Labs / betterreads-ds

MIT License
1 stars 3 forks source link

Open Library Data Cleaning + Upload to AWS RDS #5

Closed mvkumar14 closed 4 years ago

mvkumar14 commented 4 years ago

The OpenLibrary data is largely unusable for the following reason: the works and editions contain entries that are not informative / are garbage data. This is caused by bots creating duplicate 'works' and 'editions' entries in the original OpenLibrary data that have mostly null values except for small changes, such as a different publisher or description.

However, there is a solution to this issue. The editions data entries contain the text array labeled 'Works' which has a link/key to the work that the edition is supposed to be associated with. You can use this information to eliminate "garbage" entries in the works table.

See this branch for more details, and some of the work that has been done with the OpenLibrary API and AWS RDS database: https://github.com/Lambda-School-Labs/betterreads-ds/tree/database-management/Database-management

Here is the trello card associated with this issue: https://trello.com/c/RMeNuIes

michael-rowland commented 4 years ago

We have deployed our database with Goodbooks and Google Books data to AWS RDS