Closed Fluke95 closed 9 months ago
Thanks for reporting this. This seems to be a bug that might be present in all functions
library("adaR")
examples <- c(
"http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
"http://xn--80abb0biooohbv.xn--p1ai/",
"http://xn--alicantesueo-khb.com/insomnio",
"https://normal-url.com/this-path-will-be-fine",
"http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)
ada_url_parse(examples,decode = FALSE)
#> href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2 http://xn--80abb0biooohbv.xn--p1ai/
#> 3 http://xn--alicantesueo-khb.com/insomnio
#> 4 https://normal-url.com/this-path-will-be-fine
#> 5 http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#> protocol username password host hostname port
#> 1 http: поверкадома53.рф поверкадома53.рф
#> 2 http: бамбукхутор.рф бамбукхутор.рф
#> 3 http: alicantesueño.com alicantesueño.com
#> 4 https: normal-url.com normal-url.com
#> 5 http: поверкадома53.рф поверкадома53.рф
#> pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html
#> 2 /
#> 3 /insomnio
#> 4 /this-path-will-be-fine
#> 5 /this-path-will-not-be-fine
ada_url_parse(examples, decode = TRUE)
#> href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2 http://xn--80abb0biooohbv.xn--p1ai/
#> 3 http://xn--alicantesueo-khb.com/insomnio
#> 4 https://normal-url.com/this-path-will-be-fine
#> 5 http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#> protocol username password host hostname port
#> 1 http: поверкадома53.рф поверкадома53.рф
#> 2 http: бамбукхутор.рф бамбукхутор.рф
#> 3 http: alicantesueño.com alicantesueño.com
#> 4 https: normal-url.com normal-url.com
#> 5 http: поверкадома53.рф поверкадома53.рф
#> pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html
#> 2 /
#> 3 /insomnio
#> 4 /this-path-will-be-fine
#> 5 /this-path-will-not-be-fine
Created on 2024-01-10 with reprex v2.0.2
First guess is the charsub
function https://github.com/gesistsa/adaR/blob/3b43e60cca9348b76eb5b712f754399f14af35ec/src/adaR.cpp#L5-L13
specifically the call to ada_idna_to_unicode
.
@schochastics I think it only affects some urls with puny and therefore ada_get_href()
(or the internal C++ function Rcpp_ada_get_href()
).
To reduce this problem into the smallest, is this:
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/juniorprogrammer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior_programmer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html") ## BEEEEEEP!
Just to be sure, this works (modified from the C demo).
#include "ada_c.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>
static void ada_print(ada_string string) {
printf("%.*s\n", (int)string.length, string.data);
}
int main(int c, char* arg[]) {
const char* input =
"http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html";
ada_url url = ada_parse(input, strlen(input));
if (!ada_is_valid(url)) {
puts("failure");
return EXIT_FAILURE;
}
ada_print(ada_get_href(url));
ada_free(url);
return EXIT_SUCCESS;
}
## with the single-header distribution: ada.cpp and ada.h
c++ -c ada.cpp -std=c++17
cc -c demo.c
c++ demo.o ada.o -o cdemo
./cdemo
Like you said, @schochastics, a thing that I found is that there are
Maybe one solution is not always force href
to unicode but to ascii. Or just simply return the original input.
@chainsawriot do you want to give it a try to fix it? I am fine with any solution that does not affect other parts negatively
obviously there is no stress and this can wait till March
Hi! I've encountered a bug in
adaR::ada_set_*
functions family related to pathname processing. In cases where an URL is in punycode (domain starting with xn--), using adaR's set family functions changes pathname encoding and I don't know how to prevent (or revert) this behavior.For example:
will return:
Notice 1st and 5th URLs.
even though
adaR::ada_get_pathname(examples, decode = FALSE)
returns correct output:The same behavior is present even when pathname isn't changed, for example:
Also it's worth noting that
hostnames
looks different (is encoded), but the function call above didn't change the hostname at all.My
sessionInfo()