gesistsa / adaR

:computer: wrapper for ada-url a WHATWG-compliant and fast URL parser written in modern C++
https://gesistsa.github.io/adaR/
Other
26 stars 2 forks source link

Feature Request: `ada_get_basename` #56

Closed JBGruber closed 11 months ago

JBGruber commented 11 months ago

Probably a fringe use case, but the other day I tried to read the HTML data from the root of a website and though ada_get_domain would get me there.

adaR::ada_get_domain("https://github.com/schochastics/adaR/issues") |> 
  rvest::read_html()
#> Error: 'github.com' does not exist in current working directory ('/tmp/RtmpWgmD8k/reprex-95ac10e83d89-wax-mouse').

Unfortunatly, the domain is recognised as local path without the protocol. Would be fantastic if there was a function to get to the base name. This is roughly the behaviour I would expect.

ada_get_basename <- function(x) {
  sub(adaR::ada_get_pathname(x), "", x, fixed = TRUE)
}
ada_get_basename("https://github.com/schochastics/adaR/issues") |> 
  rvest::read_html()
#> {html_document}
#> <html lang="en" data-a11y-animated-images="system" data-a11y-link-underlines="true">
#> [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
#> [2] <body class="logged-out env-production page-responsive header-overlay hom ...

Created on 2023-10-05 with reprex v2.0.2

Thanks for considering!

schochastics commented 11 months ago

URLs are funny and have a crazy amount of corner cases. I think this is more stable:

ada_get_basename <- function(x) {
    protocol <- adaR::ada_get_protocol(x)
    host <- adaR::ada_get_hostname(x)
    paste0(protocol,"//",host)
}

I put it on the agenda for 0.3 to add this function

schochastics commented 11 months ago

@JBGruber will go to CRAN today or tomorrow