Ironholds / urltools

Elegant URL handling in R
Other
131 stars 32 forks source link

`url_parse` does not parse correctly with google maps url #98

Open shunyamaya opened 4 years ago

shunyamaya commented 4 years ago

Hi, thanks for developing the package. I realized that url_parse (and all of the other functions dependent on it) act strangely to google map urls.

google_maps <- "https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519"
url_parse(google_maps)

> scheme                       domain port
1  https 40.7519848,-74.0015045,14.7z <NA>
                                                                                path
1 data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519
  parameter fragment
1      <NA>     <NA>

Can this be fixed? Thanks!

Ironholds commented 4 years ago

I don't think so? Google's URLs are...very much not one's friend :(. One way of fixing it might be to url_encode the path for the parsing operation? How consistent are the URL portions /before/ the path?

hrbrmstr commented 4 years ago

{curlparse} handles this if you need something in the interim.

dplyr::glimpse(
  curlparse::parse_curl("https://www.google.com/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420634fd7eba!8m2!3d40.750568!4d-73.993519")
)
Observations: 1
Variables: 9
$ scheme   <chr> "https"
$ user     <chr> NA
$ password <chr> NA
$ host     <chr> "www.google.com"
$ port     <chr> "443"
$ path     <chr> "/maps/place/Pennsylvania+Station/@40.7519848,-74.0015045,14.7z/data=!4m5!3m4!1s0x89c259ae15b2adcb:0x7955420…
$ options  <chr> NA
$ query    <chr> NA
$ fragment <chr> NA
BernhardClemm commented 1 year ago

My current, hacky, way to deal with this is to manipulate the URL before applying urltools:

url <- "https://www.google.com/maps/@42.4939588,-54.8994772,3z?entry=ttu"
domain <- urltools::domain(gsub("@", "%40", url))

So it seems that the @ is causing the problem? Is there no way to fix this within the package?

hrbrmstr commented 1 year ago

or you could just use that curlparse package?

BernhardClemm commented 1 year ago

@hrbrmstr because I also need the suffix_extract() function by urltools, and don't want to import more packages than necessary.

I see that your other package psl has some useful functions in that regard, but it's not on CRAN :(