Scripts to generate this data

I wrote a small Python script to do this for me. I do not have the script anymore, but we can rewrite it:

It used the Public Suffix List (PSL) together with the package publicsuffixlist. It was counted and sorted with collections.Counter. The shorter lists are generated using the std unix tool head as shown in the readme. Without the PSL parsing of the domains, the script was pretty much:

# remove dublicates and sort by occurence
from collections import Counter

subdomains = ...

c = Counter()

for subdomain in subdomains:
    c[subdomain] += 1

with open("common-crawl-subdomains", "w+") as f:
    for s, __ in c.most_common():
        f.write(s)
        f.write("\n")

carlbordum / common-crawl-subdomains

Scripts to generate this data #1