NSFWUTILS / RedditScrape

Quick and dirty script to suck down the pr0n from Reddit before it's too late
81 stars 6 forks source link

RedditScrape

With the news that reddit will soon be charging for API access, and that imgur is no longer hosting NSFW media, I wanted to create something to help download things while the getting is good. I've tested this on Mac and Ubuntu with python 3.9.7

The code is designed to run in a concurrent fashion (sort of like multi-threading) to help with performance.

And a big thanks to the creator(s) of gallery-dl as it makes it easy to support imgur.com, redgif.com, and gfycat.com.

Major Changes

Thanks to the contributions of a fellow anonymous redditor we now pull data from push shift. This has a lot of advantages, mainly no need for reddit credentials, but also being able to suck down as much as you want for a given subreddit. Seriously, I have over 3.3 MILLION entries in my JSON for gonewild and it's still going.

Getting Started

The following sections should help get you started.

Setup the Script

Running the script

Running this script is now a two step process.

Download data from push shift

To begin this process, run python threaded-acquire.py. If you have your subs file setup properly, it will begin to query push shift for all of the posts ever made to that sub. These will be stored in MEDIA_FOLDER/json

Warning: Some of the more popular subs, like gonewild are absolutely huge. I've been working on downloading it for 2 days now. The JSON text alone is currently 9 gigs and climbing. I recommend you skip this one for now and come back to it later.

This script is designed to run with 4 threads, so it should start trying to download data from 4 subs at once. When it's done, you can move on to the next script.

Downloading the Media

Now that you've downloaded the JSON data from push shift, you can start the actual download process by running python json-crawler.py. This will iterate through all of your JSON files to build one massive list of things to download. It will then thread this process (Should be based on what you have for MAX_WORKERS in your config file) to start downloading as fast as it can.

Running in the background

If you want to make sure this keeps running in case you get disconnected or something, you can run it in the background with something like nohup python json-crawler.py > nohup.log &. You can monitor the status of this using jobs and use kill %1 if you need to cancel it.

Stopping the script

IMPORTANT: This script runs in a threaded manner so you can't just ctrl+c your way out. If you want to cancel this you'll have to use ctrl+z first, then once it has stopped you can run kill %1.

Examining and Troubleshooting

The script will download files into 3 folders underneath MEDIA_FOLDER:

Troubleshooting

There's not a whole lot of logging for threaded-acquire.py - the original author did a tremendous job making it robust (seriously, very well done).

For the json-crawler.py I output everything into output_log.txt. If things aren't working that is the first place you should look.

Steps to take

If things aren't working, please verify the following:

Downloading json data (Original instructions from anonymous contributor)

Information about submissions (including links, upvotes, author, etc.) can be saved for further local processing. To save local subreddit data from the pushshift archives, run a command like this:

`python ./acquire_sub_posts_json.py --subreddit eyebleach`

This will acquire submission data about every archived post in the "eyebleach" subreddit. This script will hang for a while every now and again, as the pushshift server is unreliable. Recommend doing an initial test on a smaller subreddit, like "girlskissing".

Scripts for doing further processing on this data, and downloading images, should be added later...

To Do

A few things I'd like to do if time permits....