Clayton-TV / claytontv

3 stars 2 forks source link

Wrangle the data export from the current website and import into the database #32

Open Fehings opened 3 months ago

Fehings commented 3 months ago

We will be getting the full backlog of data with tags etc. that aren't included in the master sheet from the company in charge of the current website. We expect this to be in an excel spreadsheet readable format (poss. csv?) We have been told we cannot specify a format without incurring additional costs, so it is not possible to know in advance what this will look like, but it will likely be one giant mess of a file that will need cleaning and sorting into a usable format before it can be imported to our database. This issue relates to getting this data into our database.

thatgardnerone commented 1 month ago

We will be getting the full backlog of data with tags etc. that aren't included in the master sheet from the company in charge of the current website.

@Ettie-ClaytonTV Can we get a ballpark size, either from the current hosting team or if not (or if they'll charge) can we do some napkin maths?

Thinking in terms of time for migration – could take half a day or more of leaving a (fully functional and well tested) migration script running to populate our Postgres db. (Maybe we can touch base with Al and get his help directly with this).

thatgardnerone commented 1 month ago

need cleaning and sorting into a usable format before it can be imported to our database

TL;DR Save the data dump in whatever messy state it is, and extract only the clean data we want. We can sort the rest over time, without the existing data going stale and needing new imports.

Detailed question

What kind of cleaning and sorting will be required?

Can we avoid incompatibilities or extra work by importing all the data in whatever shape it is, and simply checking after it's imported that the relevant data is there. Specifically, any of the "extra" and "optional" data like the metadata that the current site uses or the programme data that we don't care about anymore can be stored in a messy collection somewhere.

This way, we focus on just the "bare bones" data we actually care about, such as details on a video, speaker, category, etc, and as the project grows or over the following weeks, we can import and clean the remaining "nice-to-have" data.

Missive00 commented 1 month ago

@thatgardnerone rough guess for total database size (including all the tables) would be 200 MB. But that doesn't include thumbnails of the speakers etc.

This is just based on scaling up a 86 video database that we've scaled up to 20000 entries, currently the database is 10700 videos.