Open jayvdb opened 4 years ago
I wrote a small Python script to do this for me. I do not have the script anymore, but we can rewrite it:
It used the Public Suffix List (PSL) together with the package publicsuffixlist. It was counted and sorted with collections.Counter. The shorter lists are generated using the std unix tool head
as shown in the readme. Without the PSL parsing of the domains, the script was pretty much:
# remove dublicates and sort by occurence
from collections import Counter
subdomains = ...
c = Counter()
for subdomain in subdomains:
c[subdomain] += 1
with open("common-crawl-subdomains", "w+") as f:
for s, __ in c.most_common():
f.write(s)
f.write("\n")
Are the tools used to create this data also available somewhere?