ahmia / ahmia-crawler

Collection of crawlers used by the ahmia search engine
BSD 3-Clause "New" or "Revised" License
155 stars 45 forks source link

https://ahmia.fi/

Ahmia is the search engine for .onion domains on the Tor anonymity network. It is led by Juha Nurmi and is based in Finland. This repository contains crawlers used by Ahmia search engine.

Prerequisites

Ahmia-index should be installed and running

Installation guide

Install requirements in a virtual environment

python3 -m virtualenv venv3
source venv3/bin/activate
pip install -r requirements.txt

Prefer own python HTTP proxy

Look fleet installation here.

Configuration

ahmia/ahmia/example.env contains some default values that should work out of the box. Copy this to .env to create your own instance of environment settings:

cp ahmia/ahmia/example.env ahmia/ahmia/.env

Usage

In order to execute the crawler to run permanently:

source venv/bin/activate
./run.sh &> crawler.log

Specific run examples

scrapy crawl ahmia-tor -s DEPTH_LIMIT=1 -s LOG_LEVEL=DEBUG
or
scrapy crawl ahmia-tor -s DEPTH_LIMIT=1 -O items.json:json
or
scrapy crawl ahmia-tor -s DEPTH_LIMIT=3

Crontabs

# Every day
PATH=/usr/local/bin:/usr/bin:/bin:/home/juha/.local/bin
30 06 * * * cd /home/juha/ahmia-crawler/ && bash run_daily.sh > ./daily.log 2>&1
# First day of each month
PATH=/usr/local/bin:/usr/bin:/bin:/home/juha/.local/bin
30 01 01 * * cd /home/juha/ahmia-crawler/ && bash run.sh > ./monthly.log 2>&1