MC874 commented 2 years ago

💡 Summary

Provide alternate techniques to find CDN behind target domains and/or Simplify the uses of CDN Engine. The latest engine doesn't quite cut it in a matter of speed, the scanning can be ranging from 1-15/s wich is quite slow, the ideal target should be atleast 0,5-1/s. Using threading with max cpu_count also can't help very much.

Motivation and context

This is a problem when scanning list of subdomain , the reason is; some subdomain could have different CDN than the parent domain. Resulting in giant number of subdomain.

For Example: Parent Domain: line.me Parent CDN: Amazon CloudFront Subdomain: blog.clova.line.me Subdomain CDN: CloudFlare

Why does this work belong in this project?

This would be useful for scanning giant list of subdomain and enhancing the CDN Engine. As for example; I have about 166000 approximate lines of target-list with per domain scanning takes up >1/s, resulting in thousand of minute until finish.

Scan Times

Implementation notes

The alternate way could be using public database such as HackerTarget or DnsDumpster to return CDN Value. Or providing some parameter to ignore specified steps such as --skip whois, this also could cut times but not recommended Or Probably enhancing current CDN Engine?

Acceptance criteria

[ ] Target Times 0-1/s

S4lt5 commented 1 year ago

I can confirm that running a single target line.me took a whopping 51s

S4lt5 commented 1 year ago

I spent a good amount of time with this today, and came to the following rough conclusions:

Switching to asyncio inside of the executor doesn't really help much, there's a lot of long blocking IO going on.
Adjusting the threads and timeout helps a lot, in all modes.

I was able to get a roughly 100% speed increase by using 30 threads and timeout of 30s instead of 60s.

I think the timeout is far and away the most important factor. Higher than about 30 threads I Would get WHOIS lookup failures, so I assume I was getting rate limited by that service.

S4lt5 commented 1 year ago

For my use case, anything faster than 30 threads, approximately, causes me to run into a WHOIS throttling issue, so there's no real reason for me to boost performance faster than 100domains/min, unless I can get unrestricted WHOIS queries going.

Similarly, with that as a upper limit for my own performance, even when swapping out the threading or reorganizing the workflow, I don't really beat the original time because I still have to wait for all the WHOIS queries.

If I need to go further, I'd probably break it across several containers/hosts/pods/whatever and add some launcher to split up the input list into multiple chunks and reconstitute the output at the end , as that's almost certainly guaranteed to work and be a cheap/easy solution.

Pascal-0x90 commented 1 year ago

Thoughts

I am looking to add something to the scanning engine in the near future but the concept would be:

Have address pools of top X CDNs (cloudflare, cloudfront, etc)
Resolve given domain to IP or IPs via DNS query
Cross check IP addresses discovered using known IP pools.
If nothing found, fallback to using other methods.

Then we can avoid using the other methods since we already know what the CDN is. Then if we want information from those sources, we can maybe add a force flag to make it still run those checks.

Results

I did some testing with this and got the IP blocks for the following CDNs:

Incapsula
Cloudflare
Akamai
Cloudfront
CacheFly
Airee
Edgecast
MaxCDN
Beluga
Limelight
Fastly
Myracloud
Azure
Clever-cloud
FastCDN

Did some preliminary testing and got the following results (note these do not include the fallback):

FINDCDN: 12.271649267000612
RESULTS:
{'date': '11/23/2022, 16:54:17', 'cdn_count': '4', 'domains': {'line.me': {'IP': "'147.92.243.206', '147.92.146.166'", 'cdns': "'.cloudfront.net'", 'cdns_by_names': "'Cloudfront'"}, 'asu.edu': {'IP': "'151.101.66.133', '151.101.194.133', '151.101.2.133', '151.101.130.133', '151.101.190.133'", 'cdns': "'.fastly.net', '.nocookie.net'", 'cdns_by_names': "'Fastly', 'Fastly'"}, 'cisa.gov': {'IP': "'104.73.243.204'", 'cdns': "'.edgekey.net', '.akamaitechnologies.fr'", 'cdns_by_names': "'Akamai', 'Akamai'"}, 'www.dmv.ca.gov': {'IP': "'18.155.202.108', '18.155.202.118', '18.155.202.62', '18.155.202.60'", 'cdns': "'.cloudfront.net'", 'cdns_by_names': "'Cloudfront'"}}}
CUSTOM: 0.005282060999888927
RESULTS
[{'line.me': [None]}, {'asu.edu': ['Fastly']}, {'cisa.gov': ['Akamai']}, {'www.dmv.ca.gov': ['Cloudfront']}]

So there is default everything on the Findcdn side. Hereis the test code:

from dns.resolver import NXDOMAIN, NoAnswer, NoNameservers, Resolver, Timeout, resolve
from ipaddress import ip_network, ip_address
from timeit import default_timer as timer
from findcdn import main

CDN_RANGES = {
    "Incapsula": [
        ip_network("199.83.128.0/21"),
        ip_network("198.143.32.0/19"),
        ip_network("149.126.72.0/21"),
        ip_network("103.28.248.0/22"),
        ip_network("45.64.64.0/22"),
        ip_network("45.64.64.0/22"),
        ip_network("192.230.64.0/18"),
        ip_network("107.154.0.0/16"),
        ip_network("107.154.0.0/16"),
        ip_network("45.223.0.0/16"),
    ],
   # ... (more for each cdn mentioned)
]

def check_cdn(domain):
    # Resolve domain ip address
    resp = resolve(domain)
    ips = []
    for ip in resp:
        if str(ip.address) not in ips:
            ips.append(ip_address(str(ip.address)))
    # For each IP, perform a check on CDN_RANGES
    cdns = []
    for ip in ips:
        FOUND = None
        for cdn, ranges in CDN_RANGES.items():
            for rng in ranges:
                if ip in rng:
                    FOUND = cdn
                    break
            if FOUND:
                break
        cdns.append(FOUND)
    return list(set(cdns))

DOMAINS = [
    "line.me",
    "asu.edu",
    "cisa.gov",
    "www.dmv.ca.gov"
]

# Test using Findcdn 
import json

start = timer()

result = main(DOMAINS)

stop = timer()

print(f"FINDCDN: {stop - start}")
print("RESULTS:")
print(json.loads(result))

# Test using custom
start = timer()
res = []
for dom in DOMAINS:
    res.append({dom : check_cdn(dom)})
stop = timer()

print(f"CUSTOM: {stop - start}")
print("RESULTS")
print(res)

Pascal-0x90 commented 1 year ago

To follow up on the above comment:

For Cloudfront blocks I used: https://d7uri8nf7uskq.cloudfront.net/tools/list-cloudfront-ips
For Fastly blocks I used: https://api.fastly.com/public-ip-list
For Azure I used https://management.azure.com/providers/Microsoft.Cdn/edgenodes?api-version=2021-06-01 (requires auth token though)
Then for everything else I either used the ASN for the specific CDN via ipinfo.io or publicly available information on their CDN blocks.

This still has some work to be done. Notice the failure to identify line.me as being part of Cloudfront. That is because in my PoC it only found 147.92.243.206 which is the main IP address of Line in Tokyo with no CDN in the way. This would be a case it may be good to have a fallback method. For line.me, the way it discovers the CDN is via the headers for the website.

Pascal-0x90 commented 1 year ago

CDNE v2 Notes

Some status updates. Went through and changed a bunch in how the v1 engine worked. Now:

All analyses are modularized
If one engine finds the CDN first, it skips the rest and moves to the next domain (single threaded)

Note this is still in the works, nothing here is finalized. I still need to get tests working for this.

├── analyzers
│   ├── analyzers.yml
│   ├── base.py
│   ├── __cdn_config__.py
│   ├── cnamelyzer.py
│   ├── httplyzer.py
│   ├── __init__.py
│   ├── iplyzer.py
│   └── whoislyzer.py
└── cdnEngine.py

With the cdnEngine being:

from findcdn.cev2.analyzers import ANALYZERS

def analyze_domain(domain: str):
    for analyzer in ANALYZERS.keys():
        a = ANALYZERS[analyzer]['class']
        results, error_code = a.run(domain)
        # print(f"{analyzer} ==> {results} {error_code}")
        if len(results) > 0:
            break # CDN has been found
    return results

And the init.py for analyzers does some dynamic import magic

# Third-Party Libraries
from yaml import safe_load

# Internal Libraries

# Get path where the modules should be
PWD = path.dirname(path.realpath(__file__))

# Load in analyzers config file
with open(f"{PWD}/analyzers.yml", "r") as fp:
    analyzers = safe_load(fp)

ANALYZERS = {}
for analyzer, attribs in analyzers['analyzers'].items():
    spec = util.spec_from_file_location(attribs['classname'], f"{PWD}/{attribs['filename']}")
    module = util.module_from_spec(spec)
    spec.loader.exec_module(module)
    ANALYZERS[attribs['classname']] = {
        "class": getattr(module, attribs['classname'])(), # instantiate the class here
        "arg": attribs['argument']
    }

Not sure how liked this type of importing is. I would say I like it because then all someone needs to do to extend findcdn with another module is to add another file into the analyzers folder and add to the analyzers.yml file without having to worry about the code imports too much. Im not biased either way between this method or just explicitly defining the imports in init.py.

V1 vs V2 Results

TL;DR

V2: 12.519790108985035 s (single "thread") V1: 34.34032432202366 s (4 co routine workers)

Detailed

For this I was curious what the runtime of single threaded V2 would be against the co-routine V1 with 4 workers. These are the results of running the two against each other for the following set of domains:

www.asu.edu
www.cisa.gov
www.netflix.com
www.leagueoflegends.com
github.com
www.cdnplanet.com
pascal-0x90.github.io
line.me
blog.clova.line.me
www.achp.gov
twitter.com
platform.twitter.com

Starting CDN Engine v2
================================
www.asu.edu ==> ['Fastly']
www.cisa.gov ==> ['Akamai']
www.netflix.com ==> []
www.leagueoflegends.com ==> ['Akamai']
github.com ==> []
www.cdnplanet.com ==> ['Cloudflare']
pascal-0x90.github.io ==> ['Fastly']
line.me ==> ['Cloudfront']
blog.clova.line.me ==> ['Cloudflare']
www.achp.gov ==> []
twitter.com ==> ['Twitter']
platform.twitter.com ==> ['Edgecast']

v2 Finished in: 12.519790108985035 s

================================
Starting CDN Engine v1
================================
{
    "date": "11/25/2022, 21:19:38",
    "cdn_count": "10",
    "domains": {
        "www.asu.edu": {
            "IP": "'151.101.42.133'",
            "cdns": "'.fastly.net', '.nocookie.net'",
            "cdns_by_names": "'Fastly', 'Fastly'"
        },
        "www.cisa.gov": {
            "IP": "'104.73.243.204'",
            "cdns": "'.edgekey.net', '.akamaitechnologies.fr'",
            "cdns_by_names": "'Akamai', 'Akamai'"
        },
        "www.leagueoflegends.com": {
            "IP": "'104.124.142.217'",
            "cdns": "'.edgekey.net', '.akamaitechnologies.fr'",
            "cdns_by_names": "'Akamai', 'Akamai'"
        },
        "www.cdnplanet.com": {
            "IP": "'172.67.69.93', '104.26.8.12', '104.26.9.12'",
            "cdns": "'.cloudflare.com'",
            "cdns_by_names": "'Cloudflare'"
        },
        "pascal-0x90.github.io": {
            "IP": "'185.199.111.153', '185.199.110.153', '185.199.109.153', '185.199.108.153'",
            "cdns": "'.nocookie.net'",
            "cdns_by_names": "'Fastly'"
        },
        "line.me": {
            "IP": "'147.92.243.206', '147.92.146.166'",
            "cdns": "'.cloudfront.net'",
            "cdns_by_names": "'Cloudfront'"
        },
        "blog.clova.line.me": {
            "IP": "'199.60.103.28', '199.60.103.228'",
            "cdns": "'.cloudflare.com'",
            "cdns_by_names": "'Cloudflare'"
        },
        "www.achp.gov": {
            "IP": "'52.222.85.79', '3.30.138.53'",
            "cdns": "'.amazonaws.com'",
            "cdns_by_names": "'Amazon AWS'"
        },
        "twitter.com": {
            "IP": "'104.244.42.193', '104.244.42.129', '104.244.42.65', '104.244.42.1'",
            "cdns": "'.twimg.com'",
            "cdns_by_names": "'Twitter'"
        },
        "platform.twitter.com": {
            "IP": "'192.229.163.25'",
            "cdns": "'.wac.', 'edgecastcdn.net', '.v5cdn.net'",
            "cdns_by_names": "'EdgeCast', 'EdgeCast', 'EdgeCast'"
        }
    }
}

v1 Finished in: 34.34032432202366 s

MC874 commented 1 year ago

Why www.netflix.com results in Empty CDN?

> www.netflix.com
> 18.200.8.190
> EC2-18-200-8-190.EU-WEST-1.COMPUTE.AMAZONAWS.COM

This based on https://subdomainfinder.c99.nl/ and https://bgpview.io/ip/18.200.8.190

Pascal-0x90 commented 1 year ago

At least according to CDN Planet with their CDN Finder, here, Netflix does not use a CDN. Another tool I use to validate, Wappalyzer, does not identify any CDN being used by www.netflix.com.

On that note though, yes the site itself may be hosted in AWS but it is not indicative of a CDN. You would need to use Cloudfront and put that in front of your ec2 by setting up an ALB as described here.

If this logic is flawed and folks think that identifiers of AWS mean CDN then I can fix that identifier otherwise I took out the line that says

".*\.amazonaws\.com": "Amazon AWS"

in cdn_config since that had too many false positives.

I would also like to note, it does not mean they are not using a load balancer, it just is not a CDN. So it is possible to see X-CACHE type tags but it does not mean it is part of a CDN.

Edit 1

I would also like to note too, if you look at subdomain finder, you will notice some of the Subdomains do actually use CDNs.

cdn.netflix.com               ==>              Akamai
jsapi.netflix.com             ==>              Akamai
image.netflix.com             ==>              Akamai
mcdn.netflix.com              ==>              Akamai
ncds.netflix.com              ==>              Akamai
partnertools.nrd.netflix.com  ==>          Cloudfront
top10.netflix.com             ==>          Cloudfront
creativeservices.netflix.com  ==>          Cloudflare
updates.netflix.com           ==>          Cloudflare
openconnect.netflix.com       ==>          Cloudflare
devices.netflix.com           ==>          Cloudflare
roomeo.netflix.com            ==>          Cloudfront
lacounty.netflix.com          ==>          Cloudflare
cache.netflix.com             ==>          Cloudflare

S4lt5 commented 1 year ago

Nice! Seems like a very positive change, let me know i I can help somehow!

cisagov / findcdn

[Enhancement] Fast Scanning #43

💡 Summary

Motivation and context

Why does this work belong in this project?

Implementation notes

Acceptance criteria

Thoughts

Results

CDNE v2 Notes

V1 vs V2 Results

TL;DR

Detailed

Edit 1