aliparlakci / bulk-downloader-for-reddit

Downloads and archives content from reddit
https://pypi.org/project/bdfr
GNU General Public License v3.0
2.26k stars 211 forks source link

[QUESTION] Questions regarding concurrency #894

Closed mournfully closed 1 year ago

mournfully commented 1 year ago

Having one bdfr with --comment-context was too slow. So I'm now running 4 instances concurrently. I'm wondering if that's too much. And if it's not, how much further could I scale this without getting ip banned from reddit?

The easiest way I could think of getting this working and ensuring there wouldn't be any conflicts or inefficiencies caused from multiple instances downloading the same (large) post at the same time. Was to remove already attempted posts and duplicate entries, and then manually split the output into separate files.

It seems to be working well right now. But, if someone could look through the script I made, and let me know if there was a better way, that would be lovely.

And, I'm not sure how helpful this is. But, here's the command I'm using and my directory structure.

$ python3 -m bdfr clone /data/bdfr --log ./LOGFILE --comment-context --include-id-file ./LINKS --no-dupes --search-existing -v

$ tree .
├── worker1
│   ├── LINKS
│   └── LOGFILE
├── worker2
│   ├── LINKS
│   └── LOGFILE
├── worker3
│   ├── LINKS
│   └── LOGFILE
└── worker4
    ├── LINKS
    └── LOGFILE
Serene-Arc commented 1 year ago

We can't answer that question. The BDFR doesn't use any concurrency and that is partly by design. There are already cases where a long-running instance of the BDFR gets rate limited or temporarily banned from site such as Imgur when downloading a lot of files. Reddit don't have a specific limit right now, as far as I'm aware, but that will definitely be changing in a week or so and right now I expect that they have measures in place to prevent DoS and other similar attacks, which the BDFR can appear superficially similar too.

Ultimately the BDFR is meant to be run in a single instance and we don't explicitly support other ways of doing it. Part of the ethos is 'be a good internet citizen' which includes not slamming servers. My recommendation is to use the BDFR as a background tool and just leave it running while you go about your day. Unless there's a reason that you absolutely need that data ASAP?

mournfully commented 1 year ago

Yeah, that's fair. I was just curious if someone else had any similar experiences. In any case, I won't be going over 4 concurrent instances. I suppose I'd like to get it over with before the api changes. But, I mostly just wanted to ensure that even if one instance had a problem. I'd still get something by the time I checked up on it again.

Speaking of which, I actually did have 1 instance spend nearly an hour on one submission. The post in question had nearly 13k comments, which is understandable. And I was about to ask about it, but it seems you've already dealt with it in #791. Although I wonder if that could be expanded with an option to skip the comment section only after it's reached a high enough number.

And apologies for reopening this issue. I just wanted to add some more details, in case anyone else stumbles upon this through google. Feel free to close this... again.

Serene-Arc commented 1 year ago

There is a PR or issue somewhere that suggests using a better method to get the comments, combining multiple API calls but I haven't looked into it much and it definitely hasn't been implemented yet.