This is a slightly different issue than the previous https://github.com/fukamachi/dexador/issues/67
and this one I think should be handled by dexador.
If you view the page source of the link in their example: https://www.last.fm/music/Mötley+Crüe you see those characters are urlencoded in the href attributes of links, so if you are a crawler extracting and visiting links with dexador it won't be a problem. According to RFC 1738 section 2.2 there are a handful of characters that browsers should encode in the URL when making requests: " <>\"#%{}|\^~[]`"
You can see for example in the source of this page that in the href attributes of links, ^ < and space are not encoded, but a browser will when visiting it. For example one of the URI I get extracting link href with lquery from the page is "https://docs.rs/signature/<=2.0, <2.1". If you paste that link into your browser it works fine as the browser encodes < and space, but if you request with dex:get you get a 400 error.
I added to my code:
(defun encode-uri (uri)
"Escape unsafe characters in URI according to RFC 1738 section 2.2"
(when (quri:uri-p uri)
(setf uri (quri:render-uri uri)))
(apply #'concatenate 'string
(map 'list (lambda (char)
(if (find char " <>\"#%{}|\\^~[]`")
(format nil "%~2,'0X" (char-code char))
(string char)))
uri)))
I think dexador should be encoding those characters by default before making requests? For example python requests library does.
I did quick survey reqwest (rust), golang (stdlib), cohttp (ocaml) and requests (python) are all encoding uri like this and making successful request, ruby (stdlib) doesn't
This is a slightly different issue than the previous https://github.com/fukamachi/dexador/issues/67 and this one I think should be handled by dexador. If you view the page source of the link in their example: https://www.last.fm/music/Mötley+Crüe you see those characters are urlencoded in the href attributes of links, so if you are a crawler extracting and visiting links with dexador it won't be a problem. According to RFC 1738 section 2.2 there are a handful of characters that browsers should encode in the URL when making requests: " <>\"#%{}|\^~[]`" You can see for example in the source of this page that in the href attributes of links, ^ < and space are not encoded, but a browser will when visiting it. For example one of the URI I get extracting link href with lquery from the page is "https://docs.rs/signature/<=2.0, <2.1". If you paste that link into your browser it works fine as the browser encodes < and space, but if you request with dex:get you get a 400 error.
I added to my code:
I think dexador should be encoding those characters by default before making requests? For example python requests library does.