ArthurHeitmann / arctic_shift

Making Reddit data accessible to researchers, moderators and everyone else. Interact with the data through large dumps, an API or web interface.
https://arctic-shift.photon-reddit.com
234 stars 16 forks source link

Automatizing the download of a list of subreddits #13

Closed blas-ko closed 4 months ago

blas-ko commented 4 months ago

Dear Arthur,

First of all, thanks for developing this amazing tool. It has made academic research on Reddit way more straightforward.

I'm manually using your download tool to gather specific subreddits for an academic project. However, I have a list of ~50 subreddits for which I want to download all posts and comments from early 2022 to the current date. Is there a way to automate this through your API?

Many thanks in advance

ArthurHeitmann commented 4 months ago

You would have to write your own script (which will probably also be more reliable than keeping 50 tabs open). The api is documented here. The code that the website uses is here. You could adapt it to your needs. Though depending on the subreddit sizes, the API might not be the best choice, since it only returns up 100 items per request. Some of the biggest subreddits have tens of millions of comments or more. Maybe take a look at this torrent. It goes up to 2023-12.

blas-ko commented 4 months ago

Thanks a ton, Arthur! The torrent you pointed me to is great, as some of the subreddits on my list are quite huge and have millions of members.

I've noticed, though, that when using your download tool, I'm able to download data up to the present. Does this mean it does the scrapping from Reddit directly? I think I'll combine the torrent's timespan to get data up to late 2023 and then use your download tool to complete the missing months.

Again, thanks a lot!

blas-ko commented 4 months ago

Sorry for reopening the issue, but I was left wondering: Does the download tool use the API rate limits? Does it mean that for a fairly big Reddit, say, r/geopolitics with 600k members, it would just download a sample of the data in batches of 100 entries per request?

I ask because it's actually downloading data within the timestamp I'm indicating, but I'm not sure if it's downloading everything.

ArthurHeitmann commented 4 months ago

The API has access to new posts and comments in real time, as they are archived. Only limitation is that fields like "score", "num_comments" and others, are not up to date. The API does not have any official ratelimits. Sequentially requesting a subreddits data is relatively cheap. As long as you aren't making 100 requests per second, you should be fine.

blas-ko commented 4 months ago

Got it. Thanks for the swift responses. Closing now

Many thanks again, Arthur!

ArthurHeitmann commented 4 months ago

Happy to help