gesistsa / adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
https://gesistsa.github.io/adaR/
Other
26 stars 2 forks source link

public_suffic fails with wildcard only url #44

Closed schochastics closed 11 months ago

schochastics commented 11 months ago
adaR::ada_url_parse("http://kobe.jp")
#>              href protocol username password    host hostname port pathname
#> 1 http://kobe.jp/    http:                   kobe.jp  kobe.jp             /
#>   search hash
#> 1
adaR::public_suffix("http://kobe.jp")
#> [1] "jp.kobe.jp"

Created on 2023-09-26 with reprex v2.0.2

schochastics commented 11 months ago
public_suffix <- function(url) {
    if (is.null(url)) {
        return(character())
    }
    suffix_match <- triebeard::longest_match(adaR_env$trie_ps, url_reverse(url))
    with_wildcard <- suffix_match %in% psl$wildcard
    if (any(with_wildcard)) {
        pat <- paste0("\\.", suffix_match[with_wildcard], "$")

        dom <- mapply(function(x, y) {
            if (grepl(x, y)) {
                return(sub(x, "", y))
            } else {
                return(y)
            }
        }, pat, url[with_wildcard], USE.NAMES = FALSE)
        found <- dom != url[with_wildcard]
        suffix_match[with_wildcard[found]] <- paste0(sub(".*\\.([^\\.]+)$", "\\1", dom[found]), ".", suffix_match[with_wildcard[found]])
    }
    suffix_match
}

This is ugly but fixes it. thoughts @chainsawriot ? (so many corner cases...)

schochastics commented 11 months ago

http://c.mm fails

schochastics commented 11 months ago

urltools fails with the kobe example

R> urltools::suffix_extract("http://kobe.jp")
             host subdomain       domain suffix
1 http://kobe.jp      <NA> http://kobe     jp