reconstructing original threads from the dumps

stas00 commented 5 months ago

Arthur, do you by chance have some already written scripts that could take a sub-dump and reconstruct those into threads like they are presented originally?

Like for example with praw, I can do:

        r = praw.Reddit(...)
        submissions = r.subreddit(sub_name).new(limit=None)

        for submission in submissions:
            data = [submission.title, submission.selftext]
            submission.comments.replace_more(limit=None) # deals with MoreComments objects
            for comment in submission.comments.list(): # .list() gives a flattened iterator
                data.append(comment.body)

and voila I have the original thread, but unfortunately praw is limited by reddit API limits, so only recent data can be reconstructed this way.

In the dumps you kindly provided I'd have to reconstruct it somehow manually - probably loading it into some sort of database and then query it to fetch all the items with the same link_id and then construct the comments dependency tree.

But surely you have already thought of that, so asking if perhaps you already had something to take your amazing dumps one step further.

Thank you so much!

ArthurHeitmann commented 5 months ago

I don't have a solution that is ready to be used. Personally, I have a database of all the data. But depending on your use case, that might not be the best or easiest option.

For python/script only solutions, here are some ideas:

Depending your your memory, you can load 5 - 30 million comments into memory (if needed, memory usage can be optimized by only storing strings instead deserialized json). If your dataset is small enough, this will be the easiest option.
If memory is a constraint, split comments into individual files. The file name can be the link id. And then you keep appending to it. In second pass you can iterate over all files and process them individually. This approach can also be used for other problems where you have to organize or sort large amounts of data.

hanyidaxia commented 5 months ago

Arthur, amazing work! I am decompressing the data with zstandard package, the zst file after 2023 seems to be ok, but when I use zstandard to decompress the .zst file in a stream manner there is always an error jumped out: "zstd decompress error: Frame requires too much memory for decoding" no matter how I modify the stream chunk size, is there any specific reason you think could cause this issue?

stas00 commented 5 months ago

Thank you for the suggestions, Arthur. Let me see if I can make it work.

Closing for now as you answered my question.

@hanyidaxia, please open a new Issue at https://github.com/ArthurHeitmann/arctic_shift/issues

stas00 commented 5 months ago

OK, I wrote a few quick-n-dirty scripts to accomplish this - for posterity added it here if it's useful to others: https://github.com/stas00/reddit-to-threads - it's not the most efficient code/approach but it works fine for a few Reddit subs.

ArthurHeitmann / arctic_shift

reconstructing original threads from the dumps #18