fake-name / xA-Scraper

69 stars 8 forks source link

xA-Scraper

This is a automated tool for scraping content from a number of art sites:

To Add:

Decrepit:

Checked so far:

Todo:

It also has grown a lot of other functions over time. It has a fairly complex, interactive web-interface for browsing the local gallery mirrors.

Dependencies:

The backend can either use a local sqlite database (which has poor performance, particularly when cold, but is very easy to set up), or a full postgresql instance.

Configuration is done via a file named settings.py which must be placed in the repository root. settings.base.py is an example config to work from. In general, you will probably want to copy settings.base.py to settings.py, and then add your various usernames/password/database-config.

DB Backend is selected via the USE_POSTGRESQL parameter in settings.py.

If using postgre, DB setup is left to the user. xA-Scraper requires it's own database, and the ability to make IP-based connections to the hosting PG instance. The connection information, DB name, and client name must be set in settings.py.

When using sqlite, you just have to specify the path to where you want the sqlite db to be located (or you can use the default, which is ./sqlite_db.db).

settings.py is also where the login information for the various plugins goes.

Disabling of select plugins can be accomplished by commenting out the appropriate line in main.py. The JOBS list dictates the various scheduled scraper tasks that are placed into the scheduling system.

The preferred bootstrap method is to use run_scraper.sh from the repository root. It will ensure the required packages are available (build-essential, libxml2 libxslt1-dev python3-dev libz-dev), and then install all the required python modules in a local virtualenv. Additonally, it checks if the virtualenv is present, so once it's created, ./run_scraper.sh will just source the venv, and run the scraper witout any reinstallation.

To run the web UI (which handles adding names to scrape, viewing fetched files, etc...), run run_web.sh. The expected use is to have both run_scraper.sh and run_web.sh executed as daemons.

Currently, there are some aspects that need work. The artist selection system is currently a bit broken. Currently, there isn't a clean way to remove artists from the scrape list, though you can add or modify them.

Notes:


Anyways, Pictures!

These are a few DeviantArt Artists culled from the Reddit ImaginaryLandscapes subreddit.

The web-interface has a lot of fancy mouseover preview stuff. Since this is primarily intended to run off a local network, bandwidth concerns are not too relevant, and I went a bit nuts with jQuery.

Basic Popups

There is also a somewhat experimental "gallery slice" viewing system, where horizontal mouse movement seeks through a spaced sub-set of each artist's images. The artist is determined by the row, and each horizontal 10 pixels is a different image.

Fancy Popups

Lastly, there is also a basic, chronological view of each artist's work, though it does support infinite-scrolling for their entire gallery. The scraper also preserves the description that preserves each item, and it is presented with the corresponding image.