issues
search
commoncrawl
/
cc-index-table
Index Common Crawl archives in tabular format
Apache License 2.0
106
stars
9
forks
source link
Support for DNS URIs and other non-HTTP URI/URL schemes
#15
Closed
sebastian-nagel
closed
2 years ago
sebastian-nagel
commented
2 years ago
(resolves #9)
parse dns:, metadata:, whois:, filedesc: URIs
if applicable: extract host name and other URI/URL parts
ensure that parsing of URLs which are not valid URIs does not fail
rename class CommonCrawlURL to WarcUri (
(resolves #9)