Lookyloo / lookyloo

Lookyloo is a web interface that allows users to capture a website page and then display a tree of domains that call each other.
https://www.lookyloo.eu
Other
664 stars 83 forks source link

[Feature] Improve gathering contact details for takedown requests #900

Closed Rafiot closed 2 months ago

Rafiot commented 3 months ago

To properly get rid of a malicious page, we need to submit takedown request(s) to all the relevant parties for all the URLs involved in the case. It will often be a chain of redirects involving multiple domains, but also IPs and ASNs.

The contact information are mainly gathered in lookyloo.takedown_details

https://github.com/Lookyloo/lookyloo/blob/9ffe9f23e9aeebc1686ec12d21e0911f38d9c906/lookyloo/lookyloo.py#L704

Redirect

Some redirects are pointing to legitimate domains (google.com, the legitimate domain of the bank, ... ). We do not want to send takedown requests to these legitimate websites, even when the redirects before the landing page are malicious.

We will also sometimes see a legitimate domain in the middle of a chain of redirects. We (generally) don't want to send a notification in this case.

Solution: domain blocklist. If one of the URL in the chain contain that domain, skip all the contact gathered for it

Whois

The whois records are the main source of contact information, but they're often a bit crap. They will often contain outdated, incorrect, or no email at all, or the whois server is simply non-functional.

For some whois records, we have entries like abuse-c: AR13706-RIPE. In this case, we need to trigger a follow-up whois query on AR13706-RIPE to get the proper abuse contact point - and this one should take precedence on the other ones.

Domains

Generally (there will be exceptions, welcome on the internetz), for domains, there is one whois server per TLD. This server will generally work, unless the TLD owner decides to be clever and do fun things such as rate limiting (.lu), or just be completely non-existant (.es).

Solution: Fallbacks in such cases, generally, hardcoding a relevant abuse email and letting them figure out how the actual abuse email for that domain in their database.

There are a few examples in the uwhois config: https://github.com/Lookyloo/uwhoisd/blob/main/extra/uwhoisd.ini#L64 But we should move that to lookyloo so we don't have to maintain contact lists at too many places.

IPs and ASNs

We currently don't seem to have blocklists for IPs or ASNs, this surprises me, but let's roll with that for now - reminder: they will be discarded if the domain of that IP is in the blocklist anyway.

Emails gathered in the whois entry

As we said above, the whois records are not the best, so we need a few things:

  1. straight blocklist (regexes and full match): to discard all the emails ending with ripe.net for example, or the well knows broken email addresses that we know will bounce.
  2. replace lists: some email addresses are known outdated, but we know the proper new one so when we match this specific email address, we replace it by something else

SecurityTXT

If it exists, we're good, this entry will most probably be correct - https://securitytxt.org/

IPFS

When we find a IPFS HTTP header for a node, there is an abuse email address we need to contact (currently hardcoded in lookyloo: https://github.com/Lookyloo/lookyloo/blob/9ffe9f23e9aeebc1686ec12d21e0911f38d9c906/lookyloo/lookyloo.py#L739)

That's all the source of contact information for now, but we need to consider adding more in the future.


On Lookyloo side, we need an endpoint that will:

  1. Iterate over all the redirects up to the landing page, filter out nodes with domains in the domains blocklist
  2. Gather all the relevant contact information from whois as described above, and discard/replace the email addresses accordingly. It will also add the SecurityTXT and IPFS abuse emails if needed.
  3. Once we have the sanitized list of all the email addresses for each of the nodes, we put them all together in a set (make them unique), and return that as a list for the takedown request.
  4. The endpoint should probably have a detailed switch that returns a dict with the curent format so we can still get specific contact point for specific nodes is needed.