BookOps-CAT / NightShift

Copy cataloging bot
MIT License
7 stars 1 forks source link

Datastore tables #1

Closed klinga closed 2 years ago

klinga commented 2 years ago

Hi @mwbenowitz,

This is initial schema for bot's database. I'm expecting it will require some adjustments as we go, but I hope it is a OK start for now.

Considerations:

Records representation string was added for debugging and logging purposes.

I added column archived to Resource to indicate any bibs which queries have been deleted (old, irrelevant records). I'm expecting the cleanup will include emptying WorldcatQuery.response columns, but other data could stay for any statistical analysis in the future.

Looking forward to your review!

mwbenowitz commented 2 years ago

One additional comment in general, I've used alembic to manage db migrations and I've found it really useful. Especially since you can upgrade/downgrade the schema with a few simple commands it makes testing really easy and is essentially self-documenting. Just a thought regarding the need (I'm sure) to eventually adjust this schema

klinga commented 2 years ago

Thanks for the alembic tip. I will look into it.

  • We should add indexes where necessary to support the queries that will be made (especially on Foreign Key columns). That could be done now or be added as necessary as you write the queries, which would help inform where indexes are necessary.

I guess the need will become more apparent when I start writing query methods, so maybe we can update it then?

  • Looking at this I thought perhaps it might help to initially do a "soft" delete of records that have been fully processed. Say in case someone wants to look at how that was done, and then "hard" delete them after a few more months. You mentioned "archiving" some records in the PR, but I don't see how that would work. I think we could support that functionality by adding a deleted and deletedTimestamp columns to the Resource table. Perhaps there is a different approach though.

I've been struggling a little with the deletion issue. Maybe because I was overthinking it. I will go with your suggestion of soft delete as this leaves some room for various solutions in the future. For any statistics, etc. maybe it will be better if an additional table is created. Maybe data could be flattened a little to serve stat purposes better in such table. It would persist any important information and it eliminate need to store large data related to individual records or queries. Lets keep this issue open for now. Statistical info is not a focus at the moment and it could be developed much later, even after the launch.

mwbenowitz commented 2 years ago

On indexing yeah, I think looking at what queries are needed and creating the indexes based off those is a good approach (as you know they're very easy to add).

And yeah, I think the soft deletion makes sense for the flexibility. It's one of those things where it's way easier to get more aggressive about deletions after the fact than realize you actually wanted that data around for whatever reason. I like the idea of flattening the data in a separate table and I think this structure would support figuring out how to do that. If/when we're comfortable with a more compact format for storing that stuff then we can figure out how/when to do hard deletes.

Other than that it looks good! Annoying that the Metadata API returns data in two formats, in the old endpoints you could specify the format you wanted (right?) which was a nice feature, I wonder why they did away with that.