Closed schochastics closed 9 months ago
Thank you for the implementation. But there is one problem. As I said previously, this actually only affects the href
in #66 . Therefore, we shouldn't change the behavior of extracting other fields.
Just as an example:
url <- "http://xn--53-6kcainf4buoffq.xn--p1ai/\u6e2f\u805e/junior-programmer.html"
adaR::ada_url_parse(url, to_unicode = TRUE)
#> href
#> 1 http://xn--53-6kcainf4buoffq.ʾp1aʽi/\xe6\xb8ʾ\xaf\xafE8\x81ʽʽ\x9e/juniʼoʾr.html
#> protocol username password host hostname port
#> 1 http: поверкадома53.рф поверкадома53.рф
#> pathname search hash
#> 1 /港聞/junior-programmer.html
adaR::ada_url_parse(url, to_unicode = FALSE)
#> href protocol
#> 1 http://xn--53-6kcainf4buoffq.xn--p1ai/港聞/junior-programmer.html http:
#> username password host
#> 1 xn--53-6kcainf4buoffq.xn--p1ai
#> hostname port pathname search hash
#> 1 xn--53-6kcainf4buoffq.xn--p1ai /港聞/junior-programmer.html
Created on 2024-01-31 with reprex v2.0.2
For example, in the second case, hostname
doesn't change from puny to the actual hostname
; but pathname
is actually in unicode (breaking the promise of to_unicode = FALSE
).
My solution would be not having the parameter to_unicode
. Instead, just silently converting href
(or not converting, just return the original input url
). Also, as the package requires R 4.2, one would expect the package would return UTF-8 by default. Therefore, to_unicode = FALSE
is a bit confusing.
Hmm you are right. I will adapt accordingly, thanks!
added a parameter to turn of the unicode in
charsub()
Created on 2024-01-30 with reprex v2.1.0
@chainsawriot what do you think about this solution? Does that make sense to you?