ArthurHeitmann / arctic_shift

Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.
https://arctic-shift.photon-reddit.com
285 stars 21 forks source link

reconstructing original threads from the dumps #18

Closed stas00 closed 5 months ago

stas00 commented 5 months ago

Arthur, do you by chance have some already written scripts that could take a sub-dump and reconstruct those into threads like they are presented originally?

Like for example with praw, I can do:

        r = praw.Reddit(...)
        submissions = r.subreddit(sub_name).new(limit=None)

        for submission in submissions:
            data = [submission.title, submission.selftext]
            submission.comments.replace_more(limit=None) # deals with MoreComments objects
            for comment in submission.comments.list(): # .list() gives a flattened iterator
                data.append(comment.body)

and voila I have the original thread, but unfortunately praw is limited by reddit API limits, so only recent data can be reconstructed this way.

In the dumps you kindly provided I'd have to reconstruct it somehow manually - probably loading it into some sort of database and then query it to fetch all the items with the same link_id and then construct the comments dependency tree.

But surely you have already thought of that, so asking if perhaps you already had something to take your amazing dumps one step further.

Thank you so much!

ArthurHeitmann commented 5 months ago

I don't have a solution that is ready to be used. Personally, I have a database of all the data. But depending on your use case, that might not be the best or easiest option.

For python/script only solutions, here are some ideas:

hanyidaxia commented 5 months ago

Arthur, amazing work! I am decompressing the data with zstandard package, the zst file after 2023 seems to be ok, but when I use zstandard to decompress the .zst file in a stream manner there is always an error jumped out: "zstd decompress error: Frame requires too much memory for decoding" no matter how I modify the stream chunk size, is there any specific reason you think could cause this issue?

stas00 commented 5 months ago

Thank you for the suggestions, Arthur. Let me see if I can make it work.

Closing for now as you answered my question.

@hanyidaxia, please open a new Issue at https://github.com/ArthurHeitmann/arctic_shift/issues

stas00 commented 5 months ago

OK, I wrote a few quick-n-dirty scripts to accomplish this - for posterity added it here if it's useful to others: https://github.com/stas00/reddit-to-threads - it's not the most efficient code/approach but it works fine for a few Reddit subs.