This capability, still under development, finds matches for content provided in a form accross the internet, flagging known disinformation sites
The Disinformation Laundromat - Domain Fingerprinting uses a set of indicators extracted from a webapge to provide evidence about who administers a website and how the website was made.
These indicators detemine with a high level of probability that a collection of sites is owned by the same entity.
These indicators point towards a reasonable likelihood that a collection of sites is owned by the same entity.
These indicators can be circumstantial correlations and should be substantiated with indicators of higher certainty.
Included with this tool is a small database of indicators for known sites. For more on creating your own corpus see 'Creating your own corpus' below.
This tool requires Python 3 and PIP to run and can be obtained by downloading this repository or cloning it using git:
git clone https://github.com/pbenzoni/disinfo-laundromat.git
Once the code is downloaded and you've navigated to the project, install the necessary packages
pip install -r requirements.txt
The simpler use case, where you're seeking indicators about about a few URLS, not build up a corpus of data is to simply run the code in the included flask app from the main directory.
python app.py
This should deploy a simple webapp to http://127.0.0.1:8000 or your equivalent location. To make this webapp accessible to external users, I suggest following a tutorial for deploying flask apps like this one: https://pythonbasics.org/deploy-flask-app/
To analyze many URLs and once and to have a suspect URL run against a those URLS, you'll need to generate an list of indicatorsm, then choose your comparison method.
To generate a new indicator corpus, (a list of indicators assocaited with each site), run the following command:
py crawler.py <input-filename>.csv <output-filename>.csv
by default, input-filename.csv must contain at least one column of urls with the header 'domain_name' but may contain any number of other columns. Entries in the 'domain_name' column must be formatted as 'https://subdomain.domain.TLD with no trailing slashes. The subdomain field is optional, and each uniques subdomain will be treated as a new site. The TLD may be any widely supported tld, (e.g. .com, .co.uk, .social, etc.)
To check matches within the existing corpus (e.g. with {a.com, b.com, and c.com}, comparisons will be conducted between a.com and b.com, b.com and c.com, and a.com and c.com), use the following command:
py match.py
To check a given url against the corpus, run the following command:
py match.py -domain <domain-to-be-searched>
While a GUI is provided, any function of the Laundromat is also available via an API, as described below:
[GET] /api/metadata
[GET] /api/indicators
Required request fields:
request.args.get('type', '')
[POST] /api/content
Required request fields:
request.form.get('titleQuery')
request.form.get('contentQuery')
request.form.get('combineOperator')
request.form.get('language')
request.form.get('country')
request.form.getlist('search_engines')
[POST] /api/parse-url
Required request fields:
request.form['url']
request.form.getlist('search_engines')
request.form.get('combineOperator')
request.form.get('language')
request.form.get('country')
[POST] /api/content-csv
Required request fields:
request.files['file']
request.form.get('email')
[POST] /api/fingerprint-csv
Required request fields:
request.files['file']
request.form.get('email')
request.form.get('internal_only')
request.form.get('run_urlscan')
[POST] /api/fingerprint-csv
Required request fields:
request.files['file']
request.form.get('email')
request.form.get('internal_only')
request.form.get('run_urlscan')
[POST] /api/download_csv
Required request fields:
request.form.get('csv_data', '')
See matches.csv
Add aditionnal Indicators: