A sett or set is a badger's den which usually consists of a network of tunnels and numerous entrances. Setts incorporate larger chambers used for sleeping or rearing young.
This script is designed to raise young Privacy Badgers by teaching them
about the trackers on popular sites. Every day, crawler.py
visits thousands of the top sites from the Tranco List with the latest version of Privacy Badger, and saves its findings in results.json
.
See the following EFF.org blog post for more information: Giving Privacy Badger a Jump Start.
Install Python 3.8+
Create and activate a Python virtual environment:
python3 -m venv venv
source ./venv/bin/activate
pip install -U pip
For more, read this blog post.
Install Python dependencies with pip install -r requirements.txt
Run static analysis with prospector
Run unit tests with pytest
Take a look at Badger Sett commandline flags with ./crawler.py --help
Git clone the Privacy Badger repository somewhere
Try running a tiny scan:
./crawler.py firefox 5 --no-xvfb --log-stdout --pb-dir /path/to/privacybadger
Docker takes care of all dependencies, including setting up the latest browser version.
However, Docker brings its own complexity. Problems from improper file ownership and permissions are a particular pain point.
Prerequisites: have Docker installed.
Make sure your user is part of the docker
group so that you can build and
run docker images without sudo
. You can add yourself to the group with
$ sudo usermod -aG docker $USER
Clone the repository
$ git clone https://github.com/efforg/badger-sett
Run a scan
$ BROWSER=firefox ./runscan.sh 500
This will scan the top 500 sites on the Tranco list in Chrome with the latest version of Privacy Badger's master branch.
To run the script with a different branch of Privacy Badger, set the PB_BRANCH
variable. e.g.
$ PB_BRANCH=my-feature-branch BROWSER=firefox ./runscan.sh 500
You can also pass arguments to crawler.py
, the Python script that does
the actual crawl. Any arguments passed to runscan.sh
will be
forwarded to crawler.py
. For example, to exclude all websites ending
with .gov and .mil from your website visit list:
$ BROWSER=edge ./runscan.sh 500 --exclude .gov,.mil
Monitor the scan
To have the scan print verbose output about which sites it's visiting, use
the --log-stdout
argument.
If you don't use that argument, all output will still be logged to
docker-out/log.txt
, beginning after the script outputs "Running scan in
Docker..."
To set up the script to run periodically and automatically update the repository with its results:
Create a new ssh key with ssh-keygen
. Give it a name unique to the
repository.
$ ssh-keygen
Generating public/private rsa key pair.
Enter file in which to save the key (/home/USER/.ssh/id_rsa): /home/USER/.ssh/id_rsa_badger_sett
Add the new key as a deploy key with R/W access to the repo on Github. https://developer.github.com/v3/guides/managing-deploy-keys/
Add a SSH host alias for Github that uses the new key pair. Create or open
~/.ssh/config
and add the following:
Host github-badger-sett
HostName github.com
User git
IdentityFile /home/USER/.ssh/id_rsa_badger_sett
Configure git to connect to the remote over SSH. Edit .git/config
:
[remote "origin"]
url = ssh://git@github-badger-sett:/efforg/badger-sett
This will have git
connect to the remote using the new SSH keys by default.
Create a cron job to call runscan.sh
once a day. Set the environment
variable RUN_BY_CRON=1
to turn off TTY forwarding to docker run
(which
would break the script in cron), and set GIT_PUSH=1
to have the script
automatically commit and push results.json
when the scan finishes. Here's an
example crontab
entry:
0 0 * * * RUN_BY_CRON=1 GIT_PUSH=1 BROWSER=chrome /home/USER/badger-sett/runscan.sh 6000 --exclude=.mil,.mil.??,.gov,.gov.??,.edu,.edu.??
If everything has been set up correctly, the script should push a new version
of results.json
after each scan.