gesistsa / adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
https://gesistsa.github.io/adaR/
Other
26 stars 2 forks source link

added to_unicode parameter (fix #66) #67

Closed schochastics closed 9 months ago

schochastics commented 9 months ago

added a parameter to turn of the unicode in charsub()

adaR::ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html",to_unicode = TRUE) 
#> [1] "http://xn--53-6kcainf4buoffq.ҖҖpҕҘ1җai/dҗoof/jҖuҔnior.html"
adaR::ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html", to_unicode = FALSE)
#> [1] "http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html"

Created on 2024-01-30 with reprex v2.1.0

@chainsawriot what do you think about this solution? Does that make sense to you?

chainsawriot commented 9 months ago

Thank you for the implementation. But there is one problem. As I said previously, this actually only affects the href in #66 . Therefore, we shouldn't change the behavior of extracting other fields.

Just as an example:

url <- "http://xn--53-6kcainf4buoffq.xn--p1ai/\u6e2f\u805e/junior-programmer.html"
adaR::ada_url_parse(url, to_unicode = TRUE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.ʾp1aʽi/\xe6\xb8ʾ\xaf\xafE8\x81ʽʽ\x9e/juniʼoʾr.html
#>   protocol username password             host         hostname port
#> 1    http:                   поверкадома53.рф поверкадома53.рф     
#>                       pathname search hash
#> 1 /港聞/junior-programmer.html
adaR::ada_url_parse(url, to_unicode = FALSE)
#>                                                                href protocol
#> 1 http://xn--53-6kcainf4buoffq.xn--p1ai/港聞/junior-programmer.html    http:
#>   username password                           host
#> 1                   xn--53-6kcainf4buoffq.xn--p1ai
#>                         hostname port                     pathname search hash
#> 1 xn--53-6kcainf4buoffq.xn--p1ai      /港聞/junior-programmer.html

Created on 2024-01-31 with reprex v2.0.2

For example, in the second case, hostname doesn't change from puny to the actual hostname; but pathname is actually in unicode (breaking the promise of to_unicode = FALSE).

My solution would be not having the parameter to_unicode. Instead, just silently converting href (or not converting, just return the original input url). Also, as the package requires R 4.2, one would expect the package would return UTF-8 by default. Therefore, to_unicode = FALSE is a bit confusing.

schochastics commented 9 months ago

Hmm you are right. I will adapt accordingly, thanks!