ArthurHeitmann / arctic_shift

Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.
https://arctic-shift.photon-reddit.com
234 stars 16 forks source link

Database you use for API? #6

Closed ruvilonix closed 8 months ago

ruvilonix commented 8 months ago

I'm currently trying to put Reddit data into a database for local machine learning tasks. What database are you using for holding data for the API? I'm currently trying to get ClickHouse set up because I need good storage compression ratio, but I'm not sure if there's a better option.

ArthurHeitmann commented 8 months ago

Currently I'm using postgresql in combination with zst_blocks files. I've described it in a bit more detail here. I couldn't find any data on full solutions that would compress data well, whilst having good query performance. But I'm also not very experienced when it comes to databases, so I'm open to improvements.

The main questions really are:

  1. Do you want to store all of reddits data or just a subreddit? With all of reddit you're dealing with about 20 billion posts + comments. A single subreddit might at most have a couple million.
  2. Do you want to keep the full original JSON objects or do you really only care about a couple of columns?
  3. Which columns do you want to query and how? It's pretty simple to make ID lookups if you have an index on that column. But it's a monumental challenge to query all reddit comments for a keyword and sort those results by date or score.
  4. Do you want to update data? And if so how does your DB handle a string that suddenly is longer?

In case it helps you, you can take a look at some rough calculations I made here.

Now in hindsight there might be better solutions. The main drawback with mine right now is that updating JSON objects in the DB is not really possible. The data in zst_blocks files is pretty much set in stone. You can only append data, but not update. Initially that was not needed, but now that I'm doing a second data ingress, it's a bit of an issue.

Recently I had a new idea, in case I got the opportunity to set up my DB a second time. Though I didn't calculate yet if it makes sense. But mostly it boils down to the fact that most of the storage space is used up by long text columns like "body"/"selftext", "title" and "url". They make up about 50-70% of the storage space in my current postgres DB. Since you don't need to store those twice, you can remove them from the JSON object. The idea is to keep the current postgres columns, but instead of storing the full JSON in a zst_blocks file, remove the above fields from the JSON object, encoded it in a more efficient binary JSON format, compress each object and store the JSON in each postgres row as a BLOB. When reading a rows, you can reconstruct the original JSON from the rows and the JSON BLOB. Since compressing each object individually significantly worsens the compression ratio, the storage usage might be a bit higher, though you'd have to calculate that. But as a benefit, everything would be in one place and you still have the opportunity to update the data.

But if the answer to question 1. was only a subreddit or storage space isn't an issue, life would be a lot easier.

ruvilonix commented 8 months ago

I have no experience with databases other than MySQL, and even then I didn't really get into much depth of understanding, so this is mostly new stuff to me.

Here's the details of my use case, copied from where I asked on Stack Exchange (no answers yet):

I'm playing around with machine learning projects that use Reddit comments. I would like to download all Reddit comments and submissions (available as a torrent) and store them in a database on an external hard drive for easier and faster access. I believe there are at least tens of billions of comments, each with about a dozen columns of metadata.

What I need it to do:

  • Should be compressed. My external hard drive is 4 TB. The whole reddit dataset is 2 TB compressed as zstd, and much too large when uncompressed.
    • I'll occasionally request all comments or submissions that meet some criteria, in order to create a training dataset. For example, all comments from the subreddit r/funny. Or possibly more advanced queries, like semantic text search. This doesn't need to be super fast.
    • I'll occasionally request all comments in order for fasttext to make predictions. Then I'll need to store fasttext's predictions for each comment somewhere. Then I'll request all comments that fasttext scored greater than some value, in order to predict with a more sophisticated neural network. These predictions also need to be stored somewhere.

I'm not well versed in database design. I've only used MariaDB/MySQL for smaller databases. So I'm looking for something well documented and beginner friendly.

The three I've considered are MariaDB, ClickHouse, and PostgreSQL. Any insights on best way forward?

What I decided to try, after some suggestions from ChatGPT, is MongoDB.

Your questions:

  1. This will be all of Reddit.
  2. I was going to just save all metadata just in case I need random columns, as I had assumed most of the size is in the long text columns. But I might cut it down to only those I need.
  3. A query might be batches making up every single comment, to run a text classifier model on them all, one by one. Then being able to select all comments with a classifier prediction over a certain number.
  4. The only update I see happening is adding various predictions like the above. I might just make a separate table for that. That and adding new months as they come out, if I have room.

The external HDD I just bought is 4TB. I was hoping I could get a database to compress it under that, but if not I'll just use the more recent months for now.

ArthurHeitmann commented 8 months ago

4TB might just work, if the compression works really really well. With MonoDB you should probably start with some small scale tests and see how good the compression ratio is.

You have to consider that you will need more space than just the 2+ TB for the dataset. If you use indices, those can take up a couple hundred GB. And you have to think about how you will load the data into the DB. You need to have enough storage to first download the torrent and then in addition the space for the DB. Though while loading data into the DB, you can delete files you have already loaded and no longer need.

Or if you don't care about speed and only storage efficiency matters, stick with the .zst files, since they will (probably) give you the best compression possible. And then iterate with a script over the files, to extract the data you need.

And maybe think again whether you really need all of reddits data for your purpose. If you think you're only interested in a couple subreddits, only download their data or extract those subreddits from the zst files.

ruvilonix commented 8 months ago

You need to have enough storage to first download the torrent and then in addition the space for the DB. Though while loading data into the DB, you can delete files you have already loaded and no longer need.

Yep, the plan was to download the entire torrent, which leaves 2TB of wiggle room. I'd insert one month, then delete the zst file. Then move on to the next.

Or if you don't care about speed and only storage efficiency matters, stick with the .zst files, since they will (probably) give you the best compression possible. And then iterate with a script over the files, to extract the data you need.

This is what I was doing, after downloading a month at a time before I had the hard drive. Finding all comments from a given subreddit just took so long. About 2 hours per month to go through each line for the recent months using WatchfulEye's single_file.py. Multiprocessing multiple months at once seemed to be a good way to get it significantly faster, so that's an option, but also..

If I want to add a column, or foreign key, for a model prediction for every comment, I assume it would be more difficult if the comments were inside zst files.

And maybe think again whether you really need all of reddits data for your purpose. If you think you're only interested in a couple subreddits, only download their data or extract those subreddits from the zst files.

I'm not yet sure about this. My main project is finding all comments and submissions where someone said they were treated or cured by some treatment. These kinds of comments can be anywhere across Reddit, and I don't want to limit myself to a few obvious subreddits like r/health or r/arthritis.

And I have ideas for other unrelated projects, like training generative AIs to talk like redditors from various subreddits. So I'd like to keep my options open.

Edit: I just did an initial insert test with RC_2010-01.zst. I only used some of the columns, so the size of the original file isn't really relevant, but it's 289 MB. Inserting every row with "none" compression makes a collection with a storage size of about 1 GB. If instead using "zstd", the storage size is 426 MB. Not super great compared to what the zst files are getting.