Closed schochastics closed 1 year ago
is there a get_domain
hidden somewhere in ada-url? Havent found anything.
I am here now but it does not catch all special cases
R_ada_get_domain <- function(url) {
host <- ada_get_hostname(url)
ps <- public_suffix(url)
pat <- paste0("\\.", ps, "$")
dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
domain[host == ps] <- ""
domain[is.na(ps)] <- host
}
#' @rdname ada_get_domain
#' @export
ada_get_domain <- function(url, decode = TRUE) {
.get(url, decode, R_ada_get_domain)
}
No, I don't think ada has it, given the fact it is not psl aware. It should be the TLD (via psl) plus the thing before it. How about using pat
plus all non-dot before it.
domain <- "https://www.domain.biz"
stringr::str_extract(domain, paste0("[^\\.]+\\.", public_suffix(domain)))
I think this does not work e.g. with the example in #44
Very bad way to fix this (given "kobe.jp" can be extracted).
quickfixquicksand <- function(url, suffix = adaR::public_suffix(url)) {
hostname <- adaR::ada_get_hostname(url)
if (suffix == hostname) {
return(hostname)
}
stringr::str_extract(hostname, paste0("[^\\.]+\\.", suffix))
}
quickfixquicksand("https://kobe.jp", "kobe.jp")
quickfixquicksand("https://www.bbc.co.uk")
quickfixquicksand("https://www.bmbf.de")
there are yet again special treatment for wildcard ps.
R_ada_get_domain <- function(url) {
host <- ada_get_hostname(url)
host <- sub("^www\\.", "", host)
ps <- public_suffix(url)
pat <- paste0("\\.", ps, "$")
dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
domain[host == ps & !ps %in% psl$wildcard] <- ""
domain[host == ps & ps %in% psl$wildcard] <- ps
domain[is.na(ps)] <- host
domain
}
This works for the tests I made, but will now go through the whole list you posted
oh crap this broke things again
ok we cannot support all the test cases, because not all test cases have a valid public suffix
kindly requested by webtrack team:
Just glueing some existing functions