Open MC874 opened 2 years ago
I can confirm that running a single target line.me took a whopping 51s
I spent a good amount of time with this today, and came to the following rough conclusions:
I was able to get a roughly 100% speed increase by using 30 threads and timeout of 30s instead of 60s.
I think the timeout is far and away the most important factor. Higher than about 30 threads I Would get WHOIS lookup failures, so I assume I was getting rate limited by that service.
For my use case, anything faster than 30 threads, approximately, causes me to run into a WHOIS throttling issue, so there's no real reason for me to boost performance faster than 100domains/min, unless I can get unrestricted WHOIS queries going.
Similarly, with that as a upper limit for my own performance, even when swapping out the threading or reorganizing the workflow, I don't really beat the original time because I still have to wait for all the WHOIS queries.
If I need to go further, I'd probably break it across several containers/hosts/pods/whatever and add some launcher to split up the input list into multiple chunks and reconstitute the output at the end , as that's almost certainly guaranteed to work and be a cheap/easy solution.
I am looking to add something to the scanning engine in the near future but the concept would be:
Then we can avoid using the other methods since we already know what the CDN is. Then if we want information from those sources, we can maybe add a force flag to make it still run those checks.
I did some testing with this and got the IP blocks for the following CDNs:
Did some preliminary testing and got the following results (note these do not include the fallback):
FINDCDN: 12.271649267000612
RESULTS:
{'date': '11/23/2022, 16:54:17', 'cdn_count': '4', 'domains': {'line.me': {'IP': "'147.92.243.206', '147.92.146.166'", 'cdns': "'.cloudfront.net'", 'cdns_by_names': "'Cloudfront'"}, 'asu.edu': {'IP': "'151.101.66.133', '151.101.194.133', '151.101.2.133', '151.101.130.133', '151.101.190.133'", 'cdns': "'.fastly.net', '.nocookie.net'", 'cdns_by_names': "'Fastly', 'Fastly'"}, 'cisa.gov': {'IP': "'104.73.243.204'", 'cdns': "'.edgekey.net', '.akamaitechnologies.fr'", 'cdns_by_names': "'Akamai', 'Akamai'"}, 'www.dmv.ca.gov': {'IP': "'18.155.202.108', '18.155.202.118', '18.155.202.62', '18.155.202.60'", 'cdns': "'.cloudfront.net'", 'cdns_by_names': "'Cloudfront'"}}}
CUSTOM: 0.005282060999888927
RESULTS
[{'line.me': [None]}, {'asu.edu': ['Fastly']}, {'cisa.gov': ['Akamai']}, {'www.dmv.ca.gov': ['Cloudfront']}]
So there is default everything on the Findcdn side. Hereis the test code:
from dns.resolver import NXDOMAIN, NoAnswer, NoNameservers, Resolver, Timeout, resolve
from ipaddress import ip_network, ip_address
from timeit import default_timer as timer
from findcdn import main
CDN_RANGES = {
"Incapsula": [
ip_network("199.83.128.0/21"),
ip_network("198.143.32.0/19"),
ip_network("149.126.72.0/21"),
ip_network("103.28.248.0/22"),
ip_network("45.64.64.0/22"),
ip_network("45.64.64.0/22"),
ip_network("192.230.64.0/18"),
ip_network("107.154.0.0/16"),
ip_network("107.154.0.0/16"),
ip_network("45.223.0.0/16"),
],
# ... (more for each cdn mentioned)
]
def check_cdn(domain):
# Resolve domain ip address
resp = resolve(domain)
ips = []
for ip in resp:
if str(ip.address) not in ips:
ips.append(ip_address(str(ip.address)))
# For each IP, perform a check on CDN_RANGES
cdns = []
for ip in ips:
FOUND = None
for cdn, ranges in CDN_RANGES.items():
for rng in ranges:
if ip in rng:
FOUND = cdn
break
if FOUND:
break
cdns.append(FOUND)
return list(set(cdns))
DOMAINS = [
"line.me",
"asu.edu",
"cisa.gov",
"www.dmv.ca.gov"
]
# Test using Findcdn
import json
start = timer()
result = main(DOMAINS)
stop = timer()
print(f"FINDCDN: {stop - start}")
print("RESULTS:")
print(json.loads(result))
# Test using custom
start = timer()
res = []
for dom in DOMAINS:
res.append({dom : check_cdn(dom)})
stop = timer()
print(f"CUSTOM: {stop - start}")
print("RESULTS")
print(res)
To follow up on the above comment:
This still has some work to be done. Notice the failure to identify line.me as being part of Cloudfront. That is because in my PoC it only found 147.92.243.206 which is the main IP address of Line in Tokyo with no CDN in the way. This would be a case it may be good to have a fallback method. For line.me, the way it discovers the CDN is via the headers for the website.
Some status updates. Went through and changed a bunch in how the v1 engine worked. Now:
Note this is still in the works, nothing here is finalized. I still need to get tests working for this.
├── analyzers
│  ├── analyzers.yml
│  ├── base.py
│  ├── __cdn_config__.py
│  ├── cnamelyzer.py
│  ├── httplyzer.py
│  ├── __init__.py
│  ├── iplyzer.py
│  └── whoislyzer.py
└── cdnEngine.py
With the cdnEngine being:
from findcdn.cev2.analyzers import ANALYZERS
def analyze_domain(domain: str):
for analyzer in ANALYZERS.keys():
a = ANALYZERS[analyzer]['class']
results, error_code = a.run(domain)
# print(f"{analyzer} ==> {results} {error_code}")
if len(results) > 0:
break # CDN has been found
return results
And the init.py for analyzers does some dynamic import magic
# Third-Party Libraries
from yaml import safe_load
# Internal Libraries
# Get path where the modules should be
PWD = path.dirname(path.realpath(__file__))
# Load in analyzers config file
with open(f"{PWD}/analyzers.yml", "r") as fp:
analyzers = safe_load(fp)
ANALYZERS = {}
for analyzer, attribs in analyzers['analyzers'].items():
spec = util.spec_from_file_location(attribs['classname'], f"{PWD}/{attribs['filename']}")
module = util.module_from_spec(spec)
spec.loader.exec_module(module)
ANALYZERS[attribs['classname']] = {
"class": getattr(module, attribs['classname'])(), # instantiate the class here
"arg": attribs['argument']
}
Not sure how liked this type of importing is. I would say I like it because then all someone needs to do to extend findcdn with another module is to add another file into the analyzers folder and add to the analyzers.yml file without having to worry about the code imports too much. Im not biased either way between this method or just explicitly defining the imports in init.py.
V2: 12.519790108985035 s (single "thread") V1: 34.34032432202366 s (4 co routine workers)
For this I was curious what the runtime of single threaded V2 would be against the co-routine V1 with 4 workers. These are the results of running the two against each other for the following set of domains:
Starting CDN Engine v2
================================
www.asu.edu ==> ['Fastly']
www.cisa.gov ==> ['Akamai']
www.netflix.com ==> []
www.leagueoflegends.com ==> ['Akamai']
github.com ==> []
www.cdnplanet.com ==> ['Cloudflare']
pascal-0x90.github.io ==> ['Fastly']
line.me ==> ['Cloudfront']
blog.clova.line.me ==> ['Cloudflare']
www.achp.gov ==> []
twitter.com ==> ['Twitter']
platform.twitter.com ==> ['Edgecast']
v2 Finished in: 12.519790108985035 s
================================
Starting CDN Engine v1
================================
{
"date": "11/25/2022, 21:19:38",
"cdn_count": "10",
"domains": {
"www.asu.edu": {
"IP": "'151.101.42.133'",
"cdns": "'.fastly.net', '.nocookie.net'",
"cdns_by_names": "'Fastly', 'Fastly'"
},
"www.cisa.gov": {
"IP": "'104.73.243.204'",
"cdns": "'.edgekey.net', '.akamaitechnologies.fr'",
"cdns_by_names": "'Akamai', 'Akamai'"
},
"www.leagueoflegends.com": {
"IP": "'104.124.142.217'",
"cdns": "'.edgekey.net', '.akamaitechnologies.fr'",
"cdns_by_names": "'Akamai', 'Akamai'"
},
"www.cdnplanet.com": {
"IP": "'172.67.69.93', '104.26.8.12', '104.26.9.12'",
"cdns": "'.cloudflare.com'",
"cdns_by_names": "'Cloudflare'"
},
"pascal-0x90.github.io": {
"IP": "'185.199.111.153', '185.199.110.153', '185.199.109.153', '185.199.108.153'",
"cdns": "'.nocookie.net'",
"cdns_by_names": "'Fastly'"
},
"line.me": {
"IP": "'147.92.243.206', '147.92.146.166'",
"cdns": "'.cloudfront.net'",
"cdns_by_names": "'Cloudfront'"
},
"blog.clova.line.me": {
"IP": "'199.60.103.28', '199.60.103.228'",
"cdns": "'.cloudflare.com'",
"cdns_by_names": "'Cloudflare'"
},
"www.achp.gov": {
"IP": "'52.222.85.79', '3.30.138.53'",
"cdns": "'.amazonaws.com'",
"cdns_by_names": "'Amazon AWS'"
},
"twitter.com": {
"IP": "'104.244.42.193', '104.244.42.129', '104.244.42.65', '104.244.42.1'",
"cdns": "'.twimg.com'",
"cdns_by_names": "'Twitter'"
},
"platform.twitter.com": {
"IP": "'192.229.163.25'",
"cdns": "'.wac.', 'edgecastcdn.net', '.v5cdn.net'",
"cdns_by_names": "'EdgeCast', 'EdgeCast', 'EdgeCast'"
}
}
}
v1 Finished in: 34.34032432202366 s
Why www.netflix.com
results in Empty CDN?
> www.netflix.com
> 18.200.8.190
> EC2-18-200-8-190.EU-WEST-1.COMPUTE.AMAZONAWS.COM
This based on https://subdomainfinder.c99.nl/ and https://bgpview.io/ip/18.200.8.190
At least according to CDN Planet with their CDN Finder, here, Netflix does not use a CDN. Another tool I use to validate, Wappalyzer, does not identify any CDN being used by www.netflix.com.
On that note though, yes the site itself may be hosted in AWS but it is not indicative of a CDN. You would need to use Cloudfront and put that in front of your ec2 by setting up an ALB as described here.
If this logic is flawed and folks think that identifiers of AWS mean CDN then I can fix that identifier otherwise I took out the line that says
".*\.amazonaws\.com": "Amazon AWS"
in cdn_config since that had too many false positives.
I would also like to note, it does not mean they are not using a load balancer, it just is not a CDN. So it is possible to see X-CACHE type tags but it does not mean it is part of a CDN.
I would also like to note too, if you look at subdomain finder, you will notice some of the Subdomains do actually use CDNs.
cdn.netflix.com ==> Akamai
jsapi.netflix.com ==> Akamai
image.netflix.com ==> Akamai
mcdn.netflix.com ==> Akamai
ncds.netflix.com ==> Akamai
partnertools.nrd.netflix.com ==> Cloudfront
top10.netflix.com ==> Cloudfront
creativeservices.netflix.com ==> Cloudflare
updates.netflix.com ==> Cloudflare
openconnect.netflix.com ==> Cloudflare
devices.netflix.com ==> Cloudflare
roomeo.netflix.com ==> Cloudfront
lacounty.netflix.com ==> Cloudflare
cache.netflix.com ==> Cloudflare
Nice! Seems like a very positive change, let me know i I can help somehow!
💡 Summary
Provide alternate techniques to find CDN behind target domains and/or Simplify the uses of CDN Engine. The latest engine doesn't quite cut it in a matter of speed, the scanning can be ranging from
1-15/s
wich is quite slow, the ideal target should be atleast0,5-1/s
. Usingthreading
with max cpu_count also can't help very much.Motivation and context
This is a problem when scanning list of
subdomain
, the reason is; some subdomain could have different CDN than the parent domain. Resulting in giant number ofsubdomain
.For Example: Parent Domain:
line.me
Parent CDN:Amazon CloudFront
Subdomain:blog.clova.line.me
Subdomain CDN:CloudFlare
Why does this work belong in this project?
This would be useful for scanning giant list of subdomain and enhancing the CDN Engine. As for example; I have about
166000
approximate lines of target-list with per domain scanning takes up>1/s
, resulting in thousand of minute until finish.Scan Times
Implementation notes
The alternate way could be using public database such as
HackerTarget
orDnsDumpster
to return CDN Value. Or providing some parameter to ignore specified steps such as--skip whois
, this also could cut times but not recommended Or Probably enhancing current CDN Engine?Acceptance criteria
0-1/s