anidata / palantiri

Web crawler to collect data on ht
MIT License
17 stars 8 forks source link

Web Crawler for the Human Trafficking Project

This is the core web crawler that will be used for the human trafficking project

Building

Get the code

Clone or Fork

  # clone
  git clone git@gitlab.com:atl-ads/palantiri.git     # ssh
  # or
  git clone https://gitlab.com/atl-ads/palantiri.git # http
  # build
  cd palantiri

  # Make sure you are using python3, then use pip to install dependencies
  # The anaconda package and version manager is easiest way to do this https://www.continuum.io/downloads
  pip install -e .

  # test
  python setup.py test

Running

Start a MongoMD or PostgreSQL Server

Install MongoDB or PostgreSQL and use the PostgreSQLDump or MongoDBDump class to store the collected data in a database.

Scrape

  python search.py -[cgb] <site> <optional arguments>"

A more detailed list may be obtained by running python search.py --help. example.py is an example of what we currently run. The run time for the program is around 30 minutes.

More Documentation

Dependencies

Contributing

Please see CONTRIBUTING.md for more information about contributing to this project

Questions

Please checkout our slack if you are already a part of the project or contact @danlrobertson if you have any questions.