AIL crawlers are using a splash crawler to fetch and render a domain.
The purpose of this Flask server is to simplify the installation and manage them:
git clone https://github.com/ail-project/ail-splash-manager.git
cd ail-splash-manager
./install.sh
./LAUNCH.sh -l
./LAUNCH.sh -k
./LAUNCH.sh -t
The tor proxy from the Ubuntu package is installed by default.
This package is outdated: Some v3 onion address are not resolved.
/!\ Install the tor proxy provided by The torproject to solve this issue./!\
Note: Ubuntu Install, add torrc in apt sources:
sudo sh -c 'echo "deb https://deb.torproject.org/torproject.org $(lsb_release -sc) main" >> /etc/apt/sources.list.d/tor-project.list'
Once installed, we need to allow all splash dockers to reach this proxy. You can use the configure_tor
script or configure it yourself.
Install Script
cd ail-splash-manager
./configure_tor.sh
Manual configuration:
/etc/tor/torrc
SocksPort 0.0.0.0:9050
or
SocksPort 172.17.0.1:9050
SocksPolicy accept 172.17.0.0/16
in /etc/tor/torrc
(for a linux docker, the localhost IP is 172.17.0.1; Should be adapted for other platform)sudo service tor restart
Edit config/proxies_profiles.cfg
:
[section_name]:
proxy name, each section describe a proxy.host:
proxy hostport:
proxy porttype:
proxy type, SOCKS5
or HTTP
description:
proxy descriptioncrawler_type:
crawler type (tor or i2p or web)[default_tor] # section name: proxy name
host=172.17.0.1
port=9050
type=SOCKS5
description=tor default proxy
crawler_type=tor
Edit config/containers.cfg
:
[section_name name]:
splash name, each section describe a splash container.proxy_name:
proxy name (defined in proxies_profiles.cfg)port:
single port or port range (ex: 8050 or 8050-8052),cpu:
max number of cpu allocatedmemory:
max RAM (Go) allocateddescription:
Splash descriptionnet:
network type (bridge, host...)[default_splash_tor] # section name: splash name
proxy_name=default_tor
port=8050-8052
cpu=1
memory=1
maxrss=2000
description= default splash tor
net=bridge
Go on i2p website and follow the installation instruction
config/containers.cfg
:
net:
need to be host to work[default_splash_i2p] # section name: splash name
proxy_name=default_i2p
port=8053-8055
cpu=1
memory=1
maxrss=2000
description=default splash i2p
net=host
config/proxies_profiles.cfg
:
host:
need to be 127.0.0.1 to work[default_i2p]
host=127.0.0.1
port=4444
type=HTTP
description=i2p default proxy
crawler_type=i2p
Edit /etc/squid/squid.conf
:
acl localnet src 172.17.0.0/16 # Docker IP range
http_access allow localnet
Add a new proxy in config/proxies_profiles.cfg
:
[squid_proxy]
host=172.17.0.1
port=3128
type=HTTP
description=squid web proxy
crawler_type=web
Bind this proxy to a Splash docker in config/containers.cfg
api/v1/ping
api/v1/version
api/v1/get/session_uuid
api/v1/get/proxies/all
api/v1/get/splash/all
api/v1/splash/restart
api/v1/splash/kill