OpenWPM is a web privacy measurement framework which makes it easy to collect data for privacy studies on a scale of thousands to millions of websites. OpenWPM is built on top of Firefox, with automation provided by Selenium. It includes several hooks for data collection. Check out the instrumentation section below for more details.
OpenWPM has been developed and tested on Ubuntu 14.04/16.04. An installation
script, install.sh
is included to install both the system and python
dependencies automatically. A few of the python dependencies require specific
versions, so you should install the dependencies in a virtual environment if
you're installing a shared machine. If you plan to develop OpenWPM's
instrumentation extension or run tests you will also need to install the
development dependencies included in install-dev.sh
.
It is likely that OpenWPM will work on platforms other than Ubuntu, however we do not officially support anything else. For pointers on alternative platform support see the wiki.
Once installed, it is very easy to run a quick test of OpenWPM. Check out
demo.py
for an example. This will use the default setting specified in
automation/default_manager_params.json
and
automation/default_browser_params.json
, with the exception of the changes
specified in demo.py
.
More information on the instrumentation and configuration parameters is given below.
The wiki provides a more in-depth tutorial, including a platform demo and a description of the additional commands available. You can also take a look at two of our past studies, which use the infrastructure:
OpenWPM provides several instrumentation modules which can be enabled
independently of each other for each crawl. With the exception of
response body content, all instrumentation saves to a SQLite database specified
by manager_params['database_name']
in the main output directory. Response
bodies are saved to content.ldb
. The SQLite schema specified by:
automation/schema.sql
, instrumentation may specify additional tables necessary
for their measurement data (see
extension tables).
browser_params['http_instrument'] = True
http_requests
, http_responses
, and
http_redirects
tables.
http_requests
schema
documentationchannel_id
can be used to link a request saved in the
http_requests
table to its corresponding response in the
http_responses
table.channel_id
can also be used to link a request to the subsequent
request that results after an HTTP redirect (3XX response). Use the
http_redirects
table, which includes a mapping between
old_channel_id
, the channel_id
of the HTTP request that
resulted in a 3XX response, and new_channel_id
, the HTTP request
that resulted from that redirect.navigator.plugins
)navigator.mimeTypes
)window.Storage
, window.localStorage
, window.sessionStorage
,
and window.name
access.appCodeName
, oscpu
, userAgent
, ...)window.screen
)browser_params['js_instrument'] = True
javascript
table.LevelDB
database de-duplicated by the md5 hash of the content.browser_params['save_all_content'] = True
content_hash
column of the http_responses
table contains the md5
hash for each script, and can be used to do content lookups in the
LevelDB content database.browser_params['save_javascript'] = True
to save only Javascript
files. This will lessen the performance impact of this instrumentation
when a large number of browsers are used in parallel.CommandSequence::dump_flash_cookies
command after
a page visit. Note that calling this command will close the current tab
before recording the cookie changes.flash_cookies
table.browser_params['cookie_instrument'] = True
javascript_cookies
table.browser_params['cp_instrument'] = True
content_policy
table.cookies.sqlite
database in the Firefox profile
directory.CommandSequence::dump_profile_cookies
command after
a page visit. Note that calling this command will close the current tab
before recording the cookie changes.profile_cookies
tablemanager_params['data_directory']
.manager_params['log_file']
.browser_params['profile_archive_dir']
.CommandSequence::dump_profile
command.CommandSequence::dump_page_source
command.CommandSequence::recursive_dump_page_source
command.
{
'document_url': "http://example.com",
'source': "<html> ... </html>",
'iframes': {
'frame_1': {'document_url': ...,
'source': ...,
'iframes: { ... }},
'frame_2': {'document_url': ...,
'source': ...,
'iframes: { ... }},
'frame_3': { ... }
}
}
CommandSequence::save_screenshot
command.CommandSequence::screenshot_full_page
command.
screenshot_path
.The browser and platform can be configured by two separate dictionaries. The
platform configuration options can be set in manager_params
, while the
browser configuration options can be set in browser_params
. The default
settings are given in automation/default_manager_params.json
and
automation/default_browser_params.json
.
To load the default configuration parameter dictionaries we provide a helper
function TaskManager::load_default_params
. For example:
from automation import TaskManager
manager_params, browser_params = TaskManager.load_default_params(num_browsers=5)
where manager_params
is a dictionary and browser_params
is a length 5 list
of configuration dictionaries.
data_directory
log_directory
log_file
log_directory
.database_name
data_directory
failure_limit
CommandExecutionError
exception. Otherwise the default is set
to 2 x the number of browsers plus 10.testing
Note: Instrumentation configuration options are described in the Instrumentation and Data Access section and profile configuration options are described in the Browser Profile Support section. As such, these options are left out of this section.
bot_mitigation
disable_flash
False
to re-enable. Note that
flash cookies are shared between browsers.headless
browser
firefox
is
supported.tp_cookies
always
: Accept all third-party cookiesnever
: Never accept any third-party cookiesfrom_visited
: Only accept third-party cookies from sites that have been
visited as a first party.donottrack
True
to enable Do Not Track in the browser.disconnect
True
to enable Disconnect with all blocking enabledghostery
True
to enable Ghostery with all blocking enabledhttps-everywhere
True
to enable HTTPS Everywhere in the browser.ublock-origin
True
to enable uBlock Origin in the browser.tracking-protection
True
to enable Firefox's built-in
Tracking Protection.By default OpenWPM performs a "stateful" crawl, in that it keeps a consistent browser profile between page visits in the same browser. If the browser freezes or crashes during the crawl, the profile is saved to disk and restored before the next page visit.
It's also possible to run "stateless" crawls, in which each new page visit uses
a fresh browser profile. To perform a stateless crawl you can restart the
browser after each command sequence by setting the reset
initialization
argument to True
when creating the command sequence. As an example:
manager = TaskManager.TaskManager(manager_params, browser_params)
for site in sites:
command_sequence = CommandSequence.CommandSequence(site, reset=True)
command_sequence.get(sleep=30, timeout=60)
command_sequence.dump_profile_cookies(120)
manager.execute_command_sequence(command_sequence)
In this example, the browser will get
the requested site
, sleep for 30
seconds, dump the profile cookies to the crawl database, and then restart the
browser before visiting the next site
in sites
.
It's possible to load and save profiles during stateful crawls. Profile dumps currently consist of the following browser storage items:
Other browser state, such as the browser cache, is not saved. In Issue #62 we plan to expand profiles to include all browser storage.
A browser's profile can be saved to disk for use in later crawls. This can be done using a browser command or by setting a browser configuration parameter. For long running crawls we recommend saving the profile using the browser configuration parameter as the platform will take steps to save the profile in the event of a platform-level crash, whereas there is no guarantee the browser command will run before a crash.
Browser configuration parameter: Set the profile_archive_dir
browser
parameter to a directory where the browser profile should be saved. The profile
will be automatically saved when TaskManager::close
is called or when a
platform-level crash occurs.
Browser command: See the command definition wiki page for more information.
To load a profile, specify the profile_tar
browser parameter in the browser
configuration dictionary. This should point to the location of the
profile.tar
or (profile.tar.gz
if compressed) file produced by OpenWPM.
The profile will be automatically extracted and loaded into the browser
instance for which the configuration parameter was set.
Much of OpenWPM's instrumentation is included in a Firefox add-on SDK extension.
Thus, in order to add or change instrumentation you will need a few additional
dependencies, which can be installed with install-dev.sh
.
The extension instrumentation is included in /automation/Extension/firefox/
.
Any edits within this directory will require the extension to be re-built with
jpm
to produce a new openwpm.xpi
with your updates. For more information on
developing a Firefox extension, we recommend reading this
MDN introductory tutorial,
as well as the jpm reference page.
Manual debugging with OpenWPM can be difficult. By design the platform runs all browsers in separate processes and swallows all exceptions (with the intent of continuing the crawl). We recommend using manual_test.py.
This utility allows manual debugging of the extension instrumentation with or without Selenium enabled, as well as makes it easy to launch a Selenium instance (without any instrumentation)
python -m test.manual_test
uses jpm
to build the current extension directory
and launch a Firefox instance with it.python -m test.manual_test --selenium
launches a Firefox Selenium instance
after using jpm
to automatically rebuild openwpm.xpi
. The script then
drops into an ipython
shell where the webdriver instance is available
through variable driver
.python -m test.manual_test --selenium --no_extension
launches a Firefox Selenium
instance with no instrumentation. The script then
drops into an ipython
shell where the webdriver instance is available
through variable driver
.OpenWPM's tests are build on py.test
. To run the tests you will need a few
additional dependencies, which can be installed by running install-dev.sh
.
Once installed, execute py.test -vv
in the test directory to run all tests.
WebDriverException: Message: The browser appears to have exited before we could connect...
This error indicates that Firefox exited during startup (or was prevented from starting). There are many possible causes of this error:
Both selenium and Firefox are the appropriate versions. Run the following
commands and check that the versions output match the required versions in
install.sh
and requirements.txt
. If not, re-run the install script.
cd firefox-bin/
firefox --version
and
pip show selenium
headless
browser parameter set to True
before
launching.Note that OpenWPM is under active development, and should be considered experimental software. The repository may contain experimental features that aren't fully tested. We recommend using a tagged release.
Although OpenWPM is actively used by our group for research studies and we regularly use of the data collected, it is still possible there are unknown bugs in the infrastructure. We are in the process of writing comprehensive tests to verify the integrity of all included instrumentation. Prior to using OpenWPM for your own research we encourage you to write tests (and submit pull requests!) for any instrumentation that isn't currently included in our test scripts.
If you use OpenWPM in your research, please cite our CCS 2016 publication on the infrastructure. You can use the following BibTeX.
@inproceedings{englehardt2016census,
author = "Steven Englehardt and Arvind Narayanan",
title = "{Online tracking: A 1-million-site measurement and analysis}",
booktitle = {Proceedings of ACM CCS 2016},
year = "2016",
}
As of September 2017 OpenWPM has been used in 20 studies.
OpenWPM is licensed under GNU GPLv3. Additional code has been included from FourthParty and Privacy Badger, both of which are licensed GPLv3+.