Is there a way to crawl Reddit data for a longer time?

jinfei1125 commented 3 years ago

Hi! I plan to crawl Reddit data for my final project. But when I use the PRAW package which is linked to Reddit API, I can only get ~1000 data. For example, the sample size of the hot articles only 926 if I use the following code (even though I indicate limit = 10000):

import praw
reddit = praw.Reddit(client_id = client_id,
                     client_secret = client_secret,
                    user_agent = user_agent,
                    username = username,
                    password= password)
subred = reddit.subreddit("personalfinance")
hot = subred.hot(limit = 10000)
new = subred.new(limit =10000)
controv = subred.controversial(limit = 10000)
top = subred.top(limit=10000)

there are only 926 hot or new articles and the time period is from 2021-01-03 to 2021-01-28. I guess the number of articles below each category in the subreddit (hot, new, top, etc.) is limited to 1000.

P.S. I also tried to crawl the website directly using request. However, there is a error 502. I guess Reddit refuses direct crawling and only accepts API. But how to use API get more data...??? Thanks in advance!!!

bhargavvader commented 3 years ago

Hey @jinfei1125 , we already spoke during office hours, here's a quick recap; reddit only loads 1000 results at a time, so if you want more data, specifically archived data, then you might want to check out https://www.reddit.com/r/pushshift/.

Here is another dev reddit thread which goes into this: https://www.reddit.com/r/redditdev/comments/8zhcmr/how_to_crawl_more_than_1000_posts_through_reddit/

jinfei1125 commented 3 years ago

Hi @bhargavvader , thank you so much for your help this morning! I tried the pushshift just now, but somehow it also returns at most 100 submissions/comments data. For example, if I use this URL "https://api.pushshift.io/reddit/search/submission/?subreddit=personalfinance", it returns 25 rows of data by default, but even when I set the parameter size=10000 as its GitHub page suggests, it still only returns 100 rows of data--this is the result page: https://api.pushshift.io/reddit/search/submission/?subreddit=personalfinance&size=10000
Any suggestion to solve this? I also tried to set the date parameters before and after, but it also only returns at most 100 results...

Thanks in advance!

bhargavvader commented 3 years ago

Hey @jinfei1125 , I'm afraid I don't know on the top of my head... one possible suggestion, does pushshift allow you to download json files or some kind of data dump between certain dates? That way you just use that downloaded data instead of passing a request?

jinfei1125 commented 3 years ago

@bhargavvader Hi Bhargav, thank you so much! sorry I didn't reply in time. Still trying to work on pushshift, but today I found google cloud big query seems to have reddit data before 2020, as this Reddit post mentioned:

Try this link: https://console.cloud.google.com/bigquery?utm_source=bqui&utm_medium=link&utm_campaign=classic&p=fh-bigquery&page=project&pli=1 There's several reddit datasets under fh-bigquery FH is /u/fhoffa who used to be the bigquery advocate here, but I think he switched jobs.

I am also looking at this post: https://pushshift.io/using-bigquery-with-reddit-data/ But still working on it!

jinfei1125 commented 3 years ago

I found I solved it! The data isn't very up to date, the latest month is 2019-08. But I can use SQL to extract data from 2015 to 2019. Feel good with it!

bhargavvader commented 3 years ago

Yay! I know this has worked with all the PF and WSB datasets you've been making. :) @jinfei1125

UChicago-CCA-2021 / Frequently-Asked-Questions

Is there a way to crawl Reddit data for a longer time? #25