carlbordum / common-crawl-subdomains

subdomain list based on Common Crawl data, sorted by popularity
15 stars 2 forks source link

Scripts to generate this data #1

Open jayvdb opened 4 years ago

jayvdb commented 4 years ago

Are the tools used to create this data also available somewhere?

carlbordum commented 4 years ago

I wrote a small Python script to do this for me. I do not have the script anymore, but we can rewrite it:

It used the Public Suffix List (PSL) together with the package publicsuffixlist. It was counted and sorted with collections.Counter. The shorter lists are generated using the std unix tool head as shown in the readme. Without the PSL parsing of the domains, the script was pretty much:

# remove dublicates and sort by occurence
from collections import Counter

subdomains = ...

c = Counter()

for subdomain in subdomains:
    c[subdomain] += 1

with open("common-crawl-subdomains", "w+") as f:
    for s, __ in c.most_common():
        f.write(s)
        f.write("\n")