Serene-Arc / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.3k stars 211 forks source link

[SITE] How to download "posts" which are links to comments? #892

Open germyparker opened 1 year ago

germyparker commented 1 year ago

First of all, I'm not sure how to submit this as a question, because I don't think it's a bug, and "[SITE]" seemed like the best option...?

This might be a weird question - but -

I'm trying to download an entire subreddit which consists only of links to comments in other subreddits. I'm hoping to get the single comment the link goes to (ideally actually the entire thread that follows, but beggars can't be choosers).

However, instead of getting a md file with the comment, I'm getting the content of the OP of the linked thread. Does that make sense?

Alternatively: The subreddit reposts comments by one specific user, so an alternative is to just download everything that user has ever said. This is sub-optimal for several reasons: 1, not every comment is useful/interesting, the subreddit is just the good ones, and 2, after about 30 posts, I get the following error:

praw.exceptions.ClientException: This comment does not appear to be in the comment tree 

Here's the command I'm using:

bdfr archive --user PoppinKREAM --all-comments --file-scheme '{REDDITOR}_{SUBREDDIT}_{TITLE}_{POSTID}' ./output

and the full error:

Traceback (most recent call last):
  File "/usr/local/bin/bdfr", line 8, in <module>
    sys.exit(cli())
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/local/lib/python3.10/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/click/decorators.py", line 26, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/bdfr/__main__.py", line 139, in cli_archive
    reddit_archiver.download()
  File "/usr/local/lib/python3.10/site-packages/bdfr/archiver.py", line 49, in download
    self.write_entry(submission)
  File "/usr/local/lib/python3.10/site-packages/bdfr/archiver.py", line 92, in write_entry
    self._write_entry_json(archive_entry)
  File "/usr/local/lib/python3.10/site-packages/bdfr/archiver.py", line 103, in _write_entry_json
    content = json.dumps(entry.compile())
  File "/usr/local/lib/python3.10/site-packages/bdfr/archive_entry/comment_archive_entry.py", line 18, in compile
    self.source.refresh()
  File "/usr/local/lib/python3.10/site-packages/praw/models/reddit/comment.py", line 309, in refresh
    raise ClientException(self.MISSING_COMMENT_MESSAGE)
praw.exceptions.ClientException: This comment does not appear to be in the comment tree

Finally, I think this is the post it's failing on:

https://old.reddit.com/r/reddevils/comments/146eg1s/brandon_williams_rant_roudup/jnqnprn/

I'm using the latest version via pip, updated last week.

To reiterate: I would much prefer a solution to the initial problem, if there is one: how to download posts that are links to comments.

Fakeaccount12312 commented 1 year ago

What is the subreddit you tried to originally download, and the command you used? Would like to try this myself. If you are talking about r/ShitPoppinKreamSays, it just fails downloading anything since the links there are np.reddit.com links and bdfr has no proper downloading module for that. You could try scraping the log bdfr generates for these links though, collecting them in a file and downloading that using bdfr archive --include-id-file comments.txt --comment-context. See #835 for some inspiration for how I tried that method. Some kind of hacking is probably required. Also note that #851 could cause some issues here. I check Github very infrequently, so a reply might take some time, but I hope my tips help somewhat!