Proposal: store original data in object storage

DefenderOfBasic commented 1 month ago

Goals

We want to be able to host tweet data & make it available to users as cheaply as possible. The cheapest way to do this is for people to access the raw data directly off of the S3 bucket storage.

To Do

[ ] 1. Change upload of new archives to the following URL pattern

/storage/v1/object/public/archives/<user_id>/tweets.json

❓ right now it's FriedKielbasa_2024-08-27T03:47:21.000Z.json instead of tweets.json, but in the original archive it's just tweets.json, right? Since it's prefixed with user id we can keep the original filenames?

[ ] 2. Write to an index file a JSON of usernames & account IDs

/storage/v1/object/public/index.json

[ {id: 123, username: DefenderOfBasic, lastUpdated: <timestamp> }, ...]

❓ is this a good idea? it may be a bottleneck, multiple people uploading at the same time? we could skip this and just return it from a DB query

OR could update the index like once an hour or something so it's a single process.

OR, there is a way to list all files in a bucket in supabase?

[ ] 3. Document how to access (1) an individual user's tweets (2) the whole archive. It's very simple now, just loop over a list of account IDs, and make GET requests in your favorite language to whatever files you want.

This is also very nice because users can regularly fetch the archive/just get subsets of it. The timestamps in the index can help you know when data has updated so you can only fetch what is necessary?

[ ] 4. Back-update existing uploads on object storage. Just a script to copy things over, then once confirmed it's updated, delete the old folders.
[ ] 5. Delete the GitHub releases data dump so we don't forget about that

Optional

[ ] Have a server route that is given a username and returns account ID. Makes it a little easier to find a specific user.
[ ] Are the zip files themselves also on object storage? that might be faster for the purposes of someone wanting to download the entire archive

Worth noting: there is an option to do "requestor pays" model, where we don't even pay for hosting, this is if we REALLY needed to do it with no money. (don't think this is necessary now)

Old proposal

Instead of uploading everything directly from the client to a supabase/postgres DB, we could upload it (after stripping DM's & user's email) to cloudflare/s3. - this would solve how to make automated DB snapshots available (https://github.com/TheExGenesis/community-archive/issues/59) they'd already be available with a simple HTTP GET request / in your browser directly - can have a post upload job that processes the archives & converts them to flat files ``` s3://myuser/tweet_chunk.0 s3://myuser/tweet_chunk.1 ... s3://myuser/tweet_chunk.n ``` - extremely cheap to host, no DB/servers/compute required. Very cache & CDN friendly. Even for hundreds of gigabytes. (this is a pretty standard way to make public data available like, satellite imagery) - having the original data would also be nice in case use cases come up later for updating the DB schema or changing how we store it https://github.com/TheExGenesis/community-archive/issues/70 - could also allow users with deleted/banned accounts to upload this way (just create a directory for them like `s3://user-anon-upload`. Apps that want to give users more sophisticated functionality/search would take the archive and subscribe to changes/pull from the bucket (could support a webhook for this). This way the archive's DB usage would be super minimal (probably just an index of object storage urls/ids).

TheExGenesis commented 1 month ago

we keep stripped archives in supabase s3 storage

TheExGenesis commented 1 month ago

I like the idea of supporting this even simpler version of serving the archive - just wanna make clear that

we don't anticipate high db costs (like 25$/mo), and
that the db has a lot of utility, especially for querying across archives and using data from multiple archives to complete our overall picture. Stuff like tracking conversation_id, using text from liked_tweets and the in_reply_to properties of tweets to massively expand how many tweets / users we know about

DefenderOfBasic commented 1 month ago

@TheExGenesis i think for now if we just make the stripped archives we already have publicly accessible, I can try this thing of processing them & mirroring them on cloudflare / doing some minimal post-processing, and that can serve as the snapshot of the "raw" data. (even without the cloudflare part, just having the supabase s3 storage archives public satisfies the minimal requirement of giving the public access to the archive dataset)

Looks like you just need to (1) make the bucket public (2) get its id/url and document it in api-doc.md

https://supabase.com/docs/guides/storage/serving/downloads

DefenderOfBasic commented 1 month ago

this is the other reason I think it would be nice if the archive uploading & exporting pieces were even more minimal, to support use cases like this: https://github.com/TheExGenesis/community-archive/issues/73. The app could have a config mode where the DB is optional (but that may be unnecessarily complex)

DefenderOfBasic commented 1 month ago

I've just learned that the data is already publicly available in object storage! For reference, here's an example, all of someone's tweets in a big JSON:

https://fabxmporizzqflnftavs.supabase.co/storage/v1/object/public/archives/1133288553859887106/FriedKielbasa_2024-08-27T03:47:21.000Z.json

ri72miieop commented 1 month ago

I like this idea, but I have an issue with having my entire posting history accessible publicly in clear text. Like, I wouldn't want OpenAI/anthropic to scrape my archive.

Can we have some form of access-control over the files? If we can generate a key for each user that created an account on community-archive and use that to access the files I think that would be a nice solution.

DefenderOfBasic commented 1 month ago

@ri72miieop yeah making the data available only to the people you want vs fully public/no auth is something I've been thinking a lot about. Got some initial notes/discussion here: https://github.com/TheExGenesis/community-archive/issues/10. I've been thinking of it as: (1) they've already scraped our public data, they have these huge datasets internally, but we the public do not. so we're leveling the playing field for each other (2) the growth model could be clusters of self hosted archives so we can try different policies. One could be fully open, one could be accessible only to its members, invite only (and you'd be putting your trust into those organizers not to share it in the future). Maybe some of these clusters can merge as they build trust.

(3) alternatively, you could not share your data with anyone but still use all the analytics and tools, if the apps are remote storage compatible (https://remotestorage.io/), basically keep the data offline/on your own server and the apps pull from it.

(4) maybe you share derivatives of your data, not the full thing, like you keep the raw tweets but share the embeddings, or share your top tweets (see "filter tweets before uploading" https://github.com/TheExGenesis/community-archive/issues/14 )

DefenderOfBasic commented 1 month ago

@TheExGenesis this first one should be easy right? The "change the upload directory" one:

/storage/v1/object/public/archives/<user_id>/tweets.json

asking because I wanna try doing a "client-side search" so I can have real-time regex search on all my tweets, and I wanna do it in a way that it can be using data directly from the archive (but I don't know the object storage directly for my/arbitrary user's right now). This might be an easy initial milestone?

DefenderOfBasic commented 1 month ago

actually, the code looks like it should already be this way? except archiveId contains a timestamp?

https://github.com/TheExGenesis/community-archive/blob/f77d9c70b88d4260a4ab8f0625a88de19876955b/src/components/UploadTwitterArchive.tsx#L159-L160

i think we can get very far with just this. It's not hard to get account IDs, and if I know an account ID I can get this user's tweets, and I can write a super simple tutorial for even people that have very little coding experience to explore & visualize & build with this data (@Kubbaj this would be a great project for you, can be done in plain HTML/JS while you're learning)

TheExGenesis commented 4 weeks ago

@DefenderOfBasic should we close this issue and make a new one called something like "make storage schema more ergonomic"?

DefenderOfBasic commented 4 weeks ago

@TheExGenesis sure, yeah instead of a generic "make it more ergonomic" we should open specific issues for specific small tasks as needed.

I'm adding in one last note here, this suggestion by @brentbaum https://quickwit.io/

It looks like it's a way to "have our cake & eat it". If we didn't want to pay for & scale a DB to like, a billion tweets, this looks like a way to enable archive wide search while having the data be stored only in object storage (I assume that's what Brent meant about "help with scaling", like cost wise)

TheExGenesis / community-archive