gesistsa / adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
https://gesistsa.github.io/adaR/
Other
26 stars 2 forks source link

add ada_get_domain #43

Closed schochastics closed 1 year ago

schochastics commented 1 year ago

kindly requested by webtrack team:

ada_get_domain("https://subsub.sub.domain.co.uk")
#> domain.co.uk

Just glueing some existing functions

chainsawriot commented 1 year ago

Basically this: https://raw.githubusercontent.com/publicsuffix/list/master/tests/tests.txt

schochastics commented 1 year ago

is there a get_domain hidden somewhere in ada-url? Havent found anything. I am here now but it does not catch all special cases

R_ada_get_domain <- function(url) {
    host <- ada_get_hostname(url)
    ps <- public_suffix(url)
    pat <- paste0("\\.", ps, "$")
    dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
    domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
    domain[host == ps] <- ""
    domain[is.na(ps)] <- host
}

#' @rdname ada_get_domain
#' @export
ada_get_domain <- function(url, decode = TRUE) {
    .get(url, decode, R_ada_get_domain)
}
chainsawriot commented 1 year ago

No, I don't think ada has it, given the fact it is not psl aware. It should be the TLD (via psl) plus the thing before it. How about using pat plus all non-dot before it.

domain <- "https://www.domain.biz"
stringr::str_extract(domain, paste0("[^\\.]+\\.", public_suffix(domain)))
schochastics commented 1 year ago

I think this does not work e.g. with the example in #44

chainsawriot commented 1 year ago

Very bad way to fix this (given "kobe.jp" can be extracted).

quickfixquicksand <- function(url, suffix = adaR::public_suffix(url)) {
    hostname <- adaR::ada_get_hostname(url)
    if (suffix == hostname) {
        return(hostname)
    }
    stringr::str_extract(hostname, paste0("[^\\.]+\\.", suffix))
}

quickfixquicksand("https://kobe.jp", "kobe.jp")
quickfixquicksand("https://www.bbc.co.uk")
quickfixquicksand("https://www.bmbf.de")
schochastics commented 1 year ago

there are yet again special treatment for wildcard ps.

R_ada_get_domain <- function(url) {
    host <- ada_get_hostname(url)
    host <- sub("^www\\.", "", host)
    ps <- public_suffix(url)
    pat <- paste0("\\.", ps, "$")

    dom <- mapply(function(x, y) sub(x, "", y), pat, host, USE.NAMES = FALSE)
    domain <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom), ".", ps)
    domain[host == ps & !ps %in% psl$wildcard] <- ""
    domain[host == ps & ps %in% psl$wildcard] <- ps
    domain[is.na(ps)] <- host
    domain
}

This works for the tests I made, but will now go through the whole list you posted

schochastics commented 1 year ago

oh crap this broke things again

schochastics commented 1 year ago

ok we cannot support all the test cases, because not all test cases have a valid public suffix