Closed schochastics closed 11 months ago
public_suffix <- function(url) {
if (is.null(url)) {
return(character())
}
suffix_match <- triebeard::longest_match(adaR_env$trie_ps, url_reverse(url))
with_wildcard <- suffix_match %in% psl$wildcard
if (any(with_wildcard)) {
pat <- paste0("\\.", suffix_match[with_wildcard], "$")
dom <- mapply(function(x, y) {
if (grepl(x, y)) {
return(sub(x, "", y))
} else {
return(y)
}
}, pat, url[with_wildcard], USE.NAMES = FALSE)
found <- dom != url[with_wildcard]
suffix_match[with_wildcard[found]] <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom[found]), ".", suffix_match[with_wildcard[found]])
}
suffix_match
}
This is ugly but fixes it. thoughts @chainsawriot ? (so many corner cases...)
http://c.mm
fails
urltools fails with the kobe example
R> urltools::suffix_extract("http://kobe.jp")
host subdomain domain suffix
1 http://kobe.jp <NA> http://kobe jp
Created on 2023-09-26 with reprex v2.0.2