CivicActions / edscrapers

US Department of Education Data Scraping Kit; see https://us-ed-scraping.ckan.io/dataset
GNU Affero General Public License v3.0
15 stars 9 forks source link

U.S. Department of Education scraping kit

NOTE: More specific documentation is available "on the spot", in the package and subpackages directories (e.g. edscrapers/scrapers or edscrapers/transformers).

Running the tool

Clone this repo using git clone.

Change directory into the directory created/cloned for this repo.

From within the repo directory run pip install -r requirements.txt to install all package dependencies required to run the toolkit

You need the ED_OUTPUT_PATH environment variable to be set before running. Not having the variable set in your environment will result in a fatal error. The ED_OUT_PATH environment variable is used to set the path to the directory where all output generated by this kit will be stored. The path specified must exist.

If GNU Make is available in your environment, you can run the command make install. Alternatively, run python setup.py install.

After installing, run the eds command in a command line prompt.

Containerization of Scraping Toolkit - Docker Image

If you would like to run this toolkit in a container environment, we have packaged this toolkit into a Docker image. Simply run : docker build in the root directory of this cloned repo. This will build an image of the scraping tookit from the Dockerfile

ED Scrapers Command Line Interface

To get more info on the usage on the ED Scrapers Command Line Interface - eds, read the eds cli docs.

Architectural Design

To get more info on the architectural design/approach for the scraping toolkit, read the architectural design doc

Terminology

Scrapers

Scrapers are Scrapy powered scripts that crawl through links and parse HTML pages. The proposed structure is:

Transformers

Transformers are independent scripts that take a input and return it filtered and/or restructured. They are meant to complement the work done by scrapers by taking their output and making it usable for various applications (e.g. the CKAN harvester).

We currently have 7 transformers in place:

License

GNU AFFERO GENERAL PUBLIC LICENSE