gesistsa / adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
https://gesistsa.github.io/adaR/
Other
26 stars 2 forks source link

Forced encoding change in `ada_set_*()` functions #66

Closed Fluke95 closed 9 months ago

Fluke95 commented 9 months ago

Hi! I've encountered a bug in adaR::ada_set_* functions family related to pathname processing. In cases where an URL is in punycode (domain starting with xn--), using adaR's set family functions changes pathname encoding and I don't know how to prevent (or revert) this behavior.

For example:

examples <- c(
  "http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
  "http://xn--80abb0biooohbv.xn--p1ai/",
  "http://xn--alicantesueo-khb.com/insomnio",
  "https://normal-url.com/this-path-will-be-fine",
  "http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)
pathnames <- adaR::ada_get_pathname(examples, decode = FALSE)
result_pathnames <- adaR::ada_set_pathname(examples, pathnames, decode = FALSE)

will return:

result_pathnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"

Notice 1st and 5th URLs.

even though adaR::ada_get_pathname(examples, decode = FALSE) returns correct output:

pathnames 
[1] "/pood/junior-electrical-engineer-jobs-remote.html" 
[2] "/"                                                
[3] "/insomnio"                                         
[4] "/this-path-will-be-fine"                          
[5] "/this-path-will-not-be-fine"  

The same behavior is present even when pathname isn't changed, for example:

hostnames <- adaR::ada_get_hostname(examples, decode = FALSE)
result_hostnames <- adaR::ada_set_hostname(examples, hostnames, decode = FALSE)
result_hostnames 
[1] "http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html"
[2] "http://xn--80abb0biooohbv.xn--p1ai/"                                            
[3] "http://xn--alicantesueo-khb.com/insomnio"                                       
[4] "https://normal-url.com/this-path-will-be-fine"                                  
[5] "http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be"  

Also it's worth noting that hostnames looks different (is encoded), but the function call above didn't change the hostname at all.

hostnames
[1] "поверкадома53.рф"  "бамбукхутор.рф"    "alicantesueño.com" "normal-url.com"    "поверкадома53.рф" 

My sessionInfo()

R version 4.3.0 (2023-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8    LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/Warsaw
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
[1] adaR_0.3.1

loaded via a namespace (and not attached):
[1] compiler_4.3.0    tools_4.3.0       rstudioapi_0.15.0 yaml_2.3.8        Rcpp_1.0.11       triebeard_0.4.1   renv_0.17.3
schochastics commented 9 months ago

Thanks for reporting this. This seems to be a bug that might be present in all functions

library("adaR")
examples <- c(
    "http://xn--53-6kcainf4buoffq.xn--p1ai/pood/junior-electrical-engineer-jobs-remote.html",
    "http://xn--80abb0biooohbv.xn--p1ai/",
    "http://xn--alicantesueo-khb.com/insomnio",
    "https://normal-url.com/this-path-will-be-fine",
    "http://xn--53-6kcainf4buoffq.xn--p1ai/this-path-will-not-be-fine"
)

ada_url_parse(examples,decode = FALSE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2                                             http://xn--80abb0biooohbv.xn--p1ai/
#> 3                                        http://xn--alicantesueo-khb.com/insomnio
#> 4                                   https://normal-url.com/this-path-will-be-fine
#> 5                       http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#>   protocol username password              host          hostname port
#> 1    http:                    поверкадома53.рф  поверкадома53.рф     
#> 2    http:                      бамбукхутор.рф    бамбукхутор.рф     
#> 3    http:                   alicantesueño.com alicantesueño.com     
#> 4   https:                      normal-url.com    normal-url.com     
#> 5    http:                    поверкадома53.рф  поверкадома53.рф     
#>                                            pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html            
#> 2                                                 /            
#> 3                                         /insomnio            
#> 4                           /this-path-will-be-fine            
#> 5                       /this-path-will-not-be-fine

ada_url_parse(examples, decode = TRUE)
#>                                                                              href
#> 1 http://xn--53-6kcainf4buoffq.p1aǢi/pood/junior-electǢricaǢl-engǡineer-jobs.html
#> 2                                             http://xn--80abb0biooohbv.xn--p1ai/
#> 3                                        http://xn--alicantesueo-khb.com/insomnio
#> 4                                   https://normal-url.com/this-path-will-be-fine
#> 5                       http://xn--53-6kcainf4buoffq.p1ai/this-˘path˘-will-not-be
#>   protocol username password              host          hostname port
#> 1    http:                    поверкадома53.рф  поверкадома53.рф     
#> 2    http:                      бамбукхутор.рф    бамбукхутор.рф     
#> 3    http:                   alicantesueño.com alicantesueño.com     
#> 4   https:                      normal-url.com    normal-url.com     
#> 5    http:                    поверкадома53.рф  поверкадома53.рф     
#>                                            pathname search hash
#> 1 /pood/junior-electrical-engineer-jobs-remote.html            
#> 2                                                 /            
#> 3                                         /insomnio            
#> 4                           /this-path-will-be-fine            
#> 5                       /this-path-will-not-be-fine

Created on 2024-01-10 with reprex v2.0.2

schochastics commented 9 months ago

First guess is the charsub function https://github.com/gesistsa/adaR/blob/3b43e60cca9348b76eb5b712f754399f14af35ec/src/adaR.cpp#L5-L13 specifically the call to ada_idna_to_unicode.

chainsawriot commented 9 months ago

@schochastics I think it only affects some urls with puny and therefore ada_get_href() (or the internal C++ function Rcpp_ada_get_href()).

To reduce this problem into the smallest, is this:

ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/juniorprogrammer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior_programmer.html") ## ok
ada_get_href("http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html") ## BEEEEEEP!
chainsawriot commented 9 months ago

Just to be sure, this works (modified from the C demo).

#include "ada_c.h"
#include <stdio.h>
#include <stdlib.h>
#include <stdbool.h>
#include <string.h>

static void ada_print(ada_string string) {
  printf("%.*s\n", (int)string.length, string.data);
}

int main(int c, char* arg[]) {
  const char* input =
      "http://xn--53-6kcainf4buoffq.xn--p1ai/doof/junior-programmer.html";
  ada_url url = ada_parse(input, strlen(input));
  if (!ada_is_valid(url)) {
    puts("failure");
    return EXIT_FAILURE;
  }
  ada_print(ada_get_href(url));
  ada_free(url);
  return EXIT_SUCCESS;
}
## with the single-header distribution: ada.cpp and ada.h
c++ -c ada.cpp -std=c++17
cc -c demo.c
c++ demo.o ada.o -o cdemo
./cdemo

Like you said, @schochastics, a thing that I found is that there are

https://github.com/gesistsa/adaR/blob/3b43e60cca9348b76eb5b712f754399f14af35ec/src/ada/ada_c.h#L109-L110

Maybe one solution is not always force href to unicode but to ascii. Or just simply return the original input.

schochastics commented 9 months ago

@chainsawriot do you want to give it a try to fix it? I am fine with any solution that does not affect other parts negatively

schochastics commented 9 months ago

obviously there is no stress and this can wait till March