TheExGenesis / community-archive

An open tweet database and API anyone can build on.
https://www.community-archive.org
MIT License
50 stars 7 forks source link

Downloadable snapshots of the archive? #47

Closed DefenderOfBasic closed 2 months ago

DefenderOfBasic commented 2 months ago

I want to run some silly analytics (like, what are the top 10 most used words or phrases in my archive, sort everyone else by who has the most of these phrases, help me find my soul mates)

I want to run this over all tweets without breaking the bank/spamming the supabase DB with requests. What would be the best way to do this/allow others to also do this? It'd be nice if we had like, a weekly snapshot or something that can be downloaded offline?

DefenderOfBasic commented 2 months ago

easiest way might be a straight up dump of schema + data: https://supabase.com/docs/reference/cli/supabase-db-dump

and then we can have some little scripts to post-process this, or like just examples to query basic things. Like if someone just knew how to write python or JS, they could take this function that lists all tweets from the DB dump and do whatever they want with that?

TheExGenesis commented 2 months ago

yup I feel like this would dovetail well with setting up the local dev environment bc you need it to do the dump

https://github.com/open-birdsite-db/open-birdsite-db/issues/26

I also wonder if it makes sense to do that process every time someone requests it, or if there are more cost effective ways of serving massive files like that, and we should have some service that dumps and uploads to such a service

DefenderOfBasic commented 2 months ago

Here's a supabase doc I found about automating a backup with github actions. I wonder if the most sustainable thing is, weekly github action + pushing it to an S3 bucket or something?

https://supabase.com/docs/guides/cli/github-action/backups

TheExGenesis commented 2 months ago

That seems right. Could have it in supabase storage too. It just seems wild to have the archives mirrored 3 times (tables, json, full dump), but I'm ok with dumb solutions

TheExGenesis commented 2 months ago

oh I wonder if the github action will be okay with running for hours and holding GBs of data lol

DefenderOfBasic commented 2 months ago

I kind of want to try cloudflare's R2 storage (exactly like S3 but cheaper). They have a very generous free tier:

chrome_hEbJWMGE6d

https://developers.cloudflare.com/r2/pricing/

(I want to make it so anyone can download this without us having to pay a lot for it)

TheExGenesis commented 2 months ago

I mean we're paying for supabase and already using supabase storage but I see R2 is free

TheExGenesis commented 2 months ago

last thing we need to figure out is automating the dumping and upload

https://github.com/TheExGenesis/community-archive/issues/59