Datastore tables - Githubissues

klinga commented 2 years ago

Hi @mwbenowitz,

This is initial schema for bot's database. I'm expecting it will require some adjustments as we go, but I hope it is a OK start for now.

Considerations:

must serve two library systems
the bot will query Worldcat multiple times over a longer period of time: 1st query-immediately, 2nd query-after 1 month, 3rd query-3 months, then give up and archive/delete some data
records can be upgraded/deleted by staff thus not requiring any action from the bot (any after 1st query) so the app needs to check Sierra if update is still required and record it
database should be able to provide data for any statistical analysis in the future
initially the bot will query only for full ebook records, but the process will be expanded into some print material categories - this complicates the tables a little bit

Records representation string was added for debugging and logging purposes.

I added column archived to Resource to indicate any bibs which queries have been deleted (old, irrelevant records). I'm expecting the cleanup will include emptying WorldcatQuery.response columns, but other data could stay for any statistical analysis in the future.

Looking forward to your review!

mwbenowitz commented 2 years ago

One additional comment in general, I've used alembic to manage db migrations and I've found it really useful. Especially since you can upgrade/downgrade the schema with a few simple commands it makes testing really easy and is essentially self-documenting. Just a thought regarding the need (I'm sure) to eventually adjust this schema

klinga commented 2 years ago

Thanks for the alembic tip. I will look into it.

We should add indexes where necessary to support the queries that will be made (especially on Foreign Key columns). That could be done now or be added as necessary as you write the queries, which would help inform where indexes are necessary.

I guess the need will become more apparent when I start writing query methods, so maybe we can update it then?

Looking at this I thought perhaps it might help to initially do a "soft" delete of records that have been fully processed. Say in case someone wants to look at how that was done, and then "hard" delete them after a few more months. You mentioned "archiving" some records in the PR, but I don't see how that would work. I think we could support that functionality by adding a deleted and deletedTimestamp columns to the Resource table. Perhaps there is a different approach though.

I've been struggling a little with the deletion issue. Maybe because I was overthinking it. I will go with your suggestion of soft delete as this leaves some room for various solutions in the future. For any statistics, etc. maybe it will be better if an additional table is created. Maybe data could be flattened a little to serve stat purposes better in such table. It would persist any important information and it eliminate need to store large data related to individual records or queries. Lets keep this issue open for now. Statistical info is not a focus at the moment and it could be developed much later, even after the launch.

mwbenowitz commented 2 years ago

On indexing yeah, I think looking at what queries are needed and creating the indexes based off those is a good approach (as you know they're very easy to add).

And yeah, I think the soft deletion makes sense for the flexibility. It's one of those things where it's way easier to get more aggressive about deletions after the fact than realize you actually wanted that data around for whatever reason. I like the idea of flattening the data in a separate table and I think this structure would support figuring out how to do that. If/when we're comfortable with a more compact format for storing that stuff then we can figure out how/when to do hard deletes.

Other than that it looks good! Annoying that the Metadata API returns data in two formats, in the old endpoints you could specify the format you wanted (right?) which was a nice feature, I wonder why they did away with that.

BookOps-CAT / NightShift

Datastore tables #1