Improve subdomain detection in scanner

Right now we are checking for the presence of subdomains in a directory entry's landing page URL with this function:

def validate_subdomain(url):
    """Is the landing page on a subdomain"""
    if len(url.split('.')) > 2 and url.split('.')[0] != 'https://www':
        return True
    else:
return False

In my view, looking for a subdomain by counting the . characters can lead to false positives if the domain is something like bbc.co.uk. Also, as implemented right now, it can cause false positives if the URL contains a dot character anywhere in it, such as https://nytimes.com/tips.html.

This concern is potentially blocking #494. We want good subdomain detection to ensure the redirects remain on the same overall domain.

I've done a little looking into the general problem of finding subdomains, and it seems like there's not a way to determine this with perfect accuracy just by looking at the domain string. Each country's domain registrar operates by different rules. It may be possible to cover most of the cases we expect to encounter using a regex with the most common second-level domains listed explicitly.

A step up from that would be to use some kind of library that syncs with a list, such as the public suffix list, started by Mozilla for use in browsers. There are some python projects that make use of this for URL parsing, e.g. tldextract. This would probably be more accurate than us doing a more limited version ourselves, but would require another dependency.

Note that this issue is just about detecting subdomains. Whether or not the presence of one causes a warning on a given directory entry page has been addressed as part of #514.

freedomofpress / securedrop.org

Improve subdomain detection in scanner #545