NSchrading / redditDataExtractor

The reddit Data Extractor is a cross-platform GUI tool for downloading almost any content posted to reddit. Downloads from specific users, specific subreddits, users by subreddit, and with filters on the content is supported. Some intelligence is built in to attempt to avoid downloading duplicate external content.
GNU General Public License v3.0
229 stars 29 forks source link

Note: This project is no longer supported. The frozen executable releases may continue to work; however, changes to the reddit API have rendered the current version built from source obsolete. You are welcome to fork and make improvements if you wish.

Requirements

You can download a standalone executable (for Windows or Linux 64-bit) if you don't want to compile from source here: https://github.com/NSchrading/redditDataExtractor/releases

The reddit Data Extractor

Main application

The reddit Data Extractor is a cross-platform GUI tool for downloading almost any content posted to reddit. Downloads from specific users, specific subreddits, users by subreddit, and with filters on the content is supported. Some intelligence is built in to attempt to avoid downloading duplicate external content.

For data scientists and curious redditors, JSON-encoded text files can be extracted for analysis. These contain all the characteristics of reddit submissions, including the title, a hierarchy of comments, the score, selftext, creation date, and more. Below is a snippet of some of the attributes that are retrieved.

JSON encoded submission data

For redditors who want to easily retrieve submitted images, gifs, webms, and videos, or for people interested in training machine learning applications that link semantic content (submission data) with images, the reddit data extractor supports you as well. Imgur*, Gfycat, Vidble, and Minus are specifically supported, and any direct link to an image should work as well. In addition, any video site supported by the youtube-dl project is also supported (http://rg3.github.io/youtube-dl/supportedsites.html).

* Imgur requires a client-id for non-direct links - see Site specific notes for more details.

External content can be downloaded from submission links, comments, and selftext.

Filters can be set to download only those submissions, or those submission's external content, passing the filter criteria.

Here are all the settings available:

Settings

Right clicking on a user or subreddit in the user or subreddit list that has already had content downloaded for them and selecting 'view downloaded posts' will display a preview of the downloaded content, broken up by JSON-encoded submission content, external submission links, comment links, and selftext links.

Downloaded content gui

Clicking on the display images will take you to the reddit submission from which the content was downloaded. Right clicking the submission (to the right of the image) will allow you to delete the content. Deselecting the 'restrict retrieved submissions to creation dates after the last downloaded submission' checkbox in the settings will then allow you to redownload the content. Or if you never want to download that content again, you can select 'delete content and never download again' which will blacklist it.

Add/Remove Users/Subreddits by clicking the +/- buttons or selecting File->New. Edit their names by double clicking on the "new list item" entry in the lists. Right click on the user/subreddit list dropdown or hit File->New to make new user or subreddit lists to download from. In this way you can organize users or subreddits to download from so that your downloads go faster if you only want to download from a specific subset.

The reddit data extractor will remember your settings and the users and subreddits you downloaded from, as long as you save before exiting. If you wish to delete a save, remove the 3 files under RedditDataExtractor/saves called settings.db.bak, settings.db.dat, and settings.db.dir.

If you remove a user or subreddit from the list, the program will lose all memory of having downloaded content from them, allowing you to redownload everything again. Removing them will not delete their already downloaded content from your computer, however.

Hit View->Remaining Imgur Requests (if you have set an Imgur client-id) to see how many requests you have left to make this day. Do not exceed this limit or you may be rate-limited or banned by Imgur.

The following file types are supported:
jpg, png, gif, webm
Any video file type supported by the youtube-dl project
Site specific notes:
Imgur page, gallery, and album links will only be downloadable if you obtain an Imgur API client-id and enter it into the reddit data extractor. You can obtain a client-id here: https://api.imgur.com/oauth2/addclient.
Suggested data to enter:
Application name: reddit Data Extractor
Authorization type: Anonymous usage without user authorization
Authorization callback URL: https://www.google.com (or any valid URL -- it doesn't matter for anonymous usage)
Application website: https://github.com/NSchrading/redditDataExtractor
Email: Your email address to send the client-id to.
Minus galleries are not currently supported - only page and direct links.

Running the reddit Data Extractor

If you downloaded the executable files (found here: https://github.com/NSchrading/redditDataExtractor/releases), open the folder that was downloaded. Inside you will see a bunch of .dll files or .so files. Also included in the mess of files is the executable:

Executable

On windows it is called redditDataExtractor.exe. On Linux it is called redditDataExtractor. On linux, run it by typing in the console:

    ./redditDataExtractor

On windows, simply double click the .exe.

Installing the reddit Data Extractor from Source

The reddit Data Extractor has been tested and is working for 64-bit versions of both Windows 8 and Linux Mint 16. Precompiled versions of the program are available for download if you don't want to go through the arduous process of installing PyQt. These are found under the releases section: https://github.com/NSchrading/redditDataExtractor/releases

If you do wish to run the program from source, below are the steps I took to run them on Windows and Linux. A slightly different process may be required for versions of linux not based off of Ubuntu. In general, Python 3.4, PyQt4, PRAW, BeautifulSoup4, Requests, and youtube-dl are required.

Linux
[sudo] apt-get install libxext-dev python-qt4 qt4-dev-tools libqt4-dev libqt4-core libqt4-gui build-essential

Download SIP: http://sourceforge.net/projects/pyqt/files/sip/sip-4.16.2/sip-4.16.2.tar.gz/

Download PyQt: http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.11.1/PyQt-x11-gpl-4.11.1.tar.gz/

Extract both tarballs

cd sip-4.16.2
python configure.py
make
[sudo] make install

cd PyQt-x11-gpl-4.11.1
python configure.py
make
[sudo] make install
Warning: Compiling / linking PyQt will take a while.

When that's done:

[sudo] pip install -r requirements.txt

~OR~

[sudo] pip install praw
[sudo] pip install requests
[sudo] pip install beautifulsoup4
[sudo] pip install youtube-dl

And now you should be ready to run the reddit Data Extractor! main.py is where all the magic starts.

python main.py
Windows

Download Python 3.4 for 64-bit or 32-bit systems: https://www.python.org/downloads/release/python-340/

Download the PyQt4 64-bit installer built for python 3.4: http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.11.1/PyQt4-4.11.1-gpl-Py3.4-Qt4.8.6-x64.exe

Or the PyQt4 32-bit installer built for python 3.4: http://sourceforge.net/projects/pyqt/files/PyQt4/PyQt-4.11.1/PyQt4-4.11.1-gpl-Py3.4-Qt4.8.6-x32.exe

Run the installer.

pip install -r requirements.txt

~OR~

pip install praw
pip install requests
pip install beautifulsoup4
pip install youtube-dl

And now you should be ready to run the reddit Data Extractor! main.py is where all the magic starts.

python main.py