This is the core web crawler that will be used for the human trafficking project
# clone
git clone git@gitlab.com:atl-ads/palantiri.git # ssh
# or
git clone https://gitlab.com/atl-ads/palantiri.git # http
# build
cd palantiri
# Make sure you are using python3, then use pip to install dependencies
# The anaconda package and version manager is easiest way to do this https://www.continuum.io/downloads
pip install -e .
# test
python setup.py test
Install MongoDB or PostgreSQL and use the
PostgreSQLDump
or
MongoDBDump
class
to store the collected data in a database.
python search.py -[cgb] <site> <optional arguments>"
-[cgb]
defines the domain name. E.g. -b
for .backpage.comsite
takes a comma separated list which defines the subdirectories to search. E.g. BusinessServices,ComputerServices--<argument> value
A more detailed list may be obtained by running python search.py --help
. example.py is an example of what
we currently run. The run time for the program is around 30 minutes.
Please see CONTRIBUTING.md for more information about contributing to this project
Please checkout our slack if you are already a part of the project or contact @danlrobertson if you have any questions.