Closed stas00 closed 5 months ago
I don't have a solution that is ready to be used. Personally, I have a database of all the data. But depending on your use case, that might not be the best or easiest option.
For python/script only solutions, here are some ideas:
Arthur, amazing work! I am decompressing the data with zstandard package, the zst file after 2023 seems to be ok, but when I use zstandard to decompress the .zst file in a stream manner there is always an error jumped out: "zstd decompress error: Frame requires too much memory for decoding" no matter how I modify the stream chunk size, is there any specific reason you think could cause this issue?
Thank you for the suggestions, Arthur. Let me see if I can make it work.
Closing for now as you answered my question.
@hanyidaxia, please open a new Issue at https://github.com/ArthurHeitmann/arctic_shift/issues
OK, I wrote a few quick-n-dirty scripts to accomplish this - for posterity added it here if it's useful to others: https://github.com/stas00/reddit-to-threads - it's not the most efficient code/approach but it works fine for a few Reddit subs.
Arthur, do you by chance have some already written scripts that could take a sub-dump and reconstruct those into threads like they are presented originally?
Like for example with
praw
, I can do:and voila I have the original thread, but unfortunately praw is limited by reddit API limits, so only recent data can be reconstructed this way.
In the dumps you kindly provided I'd have to reconstruct it somehow manually - probably loading it into some sort of database and then query it to fetch all the items with the same
link_id
and then construct the comments dependency tree.But surely you have already thought of that, so asking if perhaps you already had something to take your amazing dumps one step further.
Thank you so much!