fukamachi / dexador

A fast HTTP client for Common Lisp
http://ultra.wikia.com/wiki/Dexador
379 stars 41 forks source link

get: special chars in url string are not converted, results in 400 bad request #159

Open bo-tato opened 1 year ago

bo-tato commented 1 year ago

This is a slightly different issue than the previous https://github.com/fukamachi/dexador/issues/67 and this one I think should be handled by dexador. If you view the page source of the link in their example: https://www.last.fm/music/Mötley+Crüe you see those characters are urlencoded in the href attributes of links, so if you are a crawler extracting and visiting links with dexador it won't be a problem. According to RFC 1738 section 2.2 there are a handful of characters that browsers should encode in the URL when making requests: " <>\"#%{}|\^~[]`" You can see for example in the source of this page that in the href attributes of links, ^ < and space are not encoded, but a browser will when visiting it. For example one of the URI I get extracting link href with lquery from the page is "https://docs.rs/signature/<=2.0, <2.1". If you paste that link into your browser it works fine as the browser encodes < and space, but if you request with dex:get you get a 400 error.

I added to my code:

(defun encode-uri (uri)
  "Escape unsafe characters in URI according to RFC 1738 section 2.2"
  (when (quri:uri-p uri)
    (setf uri (quri:render-uri uri)))
  (apply #'concatenate 'string
         (map 'list (lambda (char)
                      (if (find char " <>\"#%{}|\\^~[]`")
                          (format nil "%~2,'0X" (char-code char))
                          (string char)))
              uri)))

I think dexador should be encoding those characters by default before making requests? For example python requests library does.

bo-tato commented 1 year ago

I did quick survey reqwest (rust), golang (stdlib), cohttp (ocaml) and requests (python) are all encoding uri like this and making successful request, ruby (stdlib) doesn't