archivesunleashed / aut

The Archives Unleashed Toolkit is an open-source toolkit for analyzing web archives.
https://aut.docs.archivesunleashed.org/
Apache License 2.0
137 stars 33 forks source link

Replace scala-uri library from ExtractDomain and just parse public_suffix_list.dat #521

Closed ruebot closed 2 years ago

ruebot commented 2 years ago

Since we're only using the library to grab domain apex, we can just roll our own implementation here.

Pull down public_suffix_list.dat, store it in memory, and do look-ups/matches based on it.

val publicSuffixes = scala.io.Source.fromURL("https://publicsuffix.org/list/public_suffix_list.dat", "utf-8").getLines.map(_.trim).filter(_.nonEmpty).filter(!_.startsWith("//")).toSet
ruebot commented 2 years ago

Follow-up issue to #520