Feedback? - Githubissues

alphapapa commented 5 years ago

@skeeto Hi Chris,

I finally decided to put together a "simple" HTTP library inspired by elfeed-curl.el. This doesn't implement much yet, but I wonder if you have any feedback on the basic design.

Probably the most notable issue at the moment is that the async version calls the callback with a parsed response struct,with the body as a string, rather than allowing the callback to work directly in the response buffer. I'll see about making that more flexible.

Also, unlike your library, it doesn't allow fetching of multiple URLs with a single curl invocation. Do you think that's important enough that I should implement that, or would the complexity not be generally worth it?

Thanks for any feedback.

skeeto commented 5 years ago

Nice work. It will be nice to have somewhere better to point people when they ask about reusing elfeed-curl.

Since you're using make-process — something I should have done — you can skip coding-system-for-read and process-connection-type. You're already using :connection-type, so also use :coding. I'm sure you realize this already, but I want to highlight it anyway (especially for anyone else following along): Only the "binary" coding system (or equivalent) makes sense since the input potentially contains multiple different encodings that need to be decoded individually.

Oh, nevermind, I just noticed you're using call-process in the sync version, so I suppose you still need to dynamically bind these variables anyway!

The type of value returned by plz-get (very clever name by the way!) varies depending on the arguments. You can get away with this sort of thing in a dynamically typed language, but it would make no sense at all in a statically typed language. This is a guideline I know I've broken myself, and, when I have, I always regret it later: Dynamically typed programs should still be consistent about its types. I mean this in the deep sense. A function doesn't just consistently return a list, it returns a list of values of a specific type, and so on — the sort of consistently required by static typing. It's good practice.

It looks like you never call plz-error? Make it a hard rule for async that either plz-success or plz-error is called exactly once per request. (One exception: if the curl program is not found it's fine to report such a fatal error directly, e.g. via make-process failing.) This is essential for later managing a queue atop the request function, and it's one of the things url-retrieve gets completely wrong. With url-retrieve, some errors are reported synchronously some of the time depending on the platform (DNS failures, TCP handshake failures, etc.), sometimes callbacks are called multiple times, and sometimes they're never called at all! As you're already aware, it's basically impossible to build anything reliable on top of url-retrieve.

One of the things I enjoy about not making elfeed-curl a general URL fetching library is I get to take some shortcuts. For example, I don't quite fully parse the response headers. Elfeed is the only consumer of those headers, and it doesn't need them to be carefully parsed, so it doesn't matter. But for a general library, you perhaps want to follow through. Technically I believe you're supposed to decode header text values as ISO-8859-1, and perhaps also handle RFC 2047 MIME decoding. (How much does curl handle here already? I don't know, but you can probably rely on the headers being at least syntactically valid.)

It's important for elfeed-curl to support multiple URLs in a single invocation. The goal isn't to reduce the number of curl processes, but to exploit HTTP/1.1 keep-alive when those URLs all belong to the same domain. If you're following multiple feeds hosted on the same site (e.g. a bunch of YouTube channels), only one TCP connection needs to be established to request all those feeds. The story is even better with HTTP/2. (I think? I don't know how well curl employs it.) It puts less stress on the server and is just generally a nice thing to do. The server is also less likely to penalize such a burst of requests if they use HTTP efficiently. The machinery to make this work is a bit complicated as you've seen (random delimiter "tokens"), so you may not think it's worth the trouble for your library, but it is important for some applications, like Elfeed.

Similar arguments can be made for last-modified and etag headers, which reduce the load on servers and is just good netiquette. Elfeed handles these headers at a higher level, and it's probably also out of scope for plz for the same reason. But it's something to consider.

Returning a string of content is certainly simpler, and it's not wrong per se. I prefer the buffer-passing method of elfeed-curl purely for the sake of efficiency. Unlike strings, buffers can be "sliced" using narrowing, so there's less garbage, fewer allocations, and less copying when presenting a narrowed process output buffer to callbacks. In Elfeed's case, the result needs to be stored in a buffer anyway for the XML parser, so strings are an even less attractive option.

alphapapa commented 5 years ago

Very thorough review. Thanks very much, Chris. Your guidance is invaluable.

alphapapa commented 5 years ago

Since you're using make-process — something I should have done — you can skip coding-system-for-read and process-connection-type. You're already using :connection-type, so also use :coding.

Thanks, done. There are a lot of process-related details like that in Emacs that I haven't worked with before, so I appreciate your pointing them out.

I'm sure you realize this already, but I want to highlight it anyway (especially for anyone else following along): Only the "binary" coding system (or equivalent) makes sense since the input potentially contains multiple different encodings that need to be decoded individually.

I guessed that was the case; thanks for confirming it.

Oh, nevermind, I just noticed you're using call-process in the sync version, so I suppose you still need to dynamically bind these variables anyway!

See below...

The type of value returned by plz-get (very clever name by the way!) varies depending on the arguments. You can get away with this sort of thing in a dynamically typed language, but it would make no sense at all in a statically typed language. This is a guideline I know I've broken myself, and, when I have, I always regret it later: Dynamically typed programs should still be consistent about its types. I mean this in the deep sense. A function doesn't just consistently return a list, it returns a list of values of a specific type, and so on — the sort of consistently required by static typing. It's good practice.

Yeah, I was a little uncomfortable with that too, so I've followed your advice and split the synchronous calls into separate functions. There's a little bit of code reuse, but not too much, and maybe that can be improved later.

So plz-get always returns the curl process now, and plz-get-sync returns an object as selected with its :as argument.

Make it a hard rule for async that either plz-success or plz-error is called exactly once per request.

Thanks. That should be the case. If it ever isn't, I'll fix it! :)

It looks like you never call plz-error?...

Errors are now handled: by default, if no error callback is given, one of two errors are signaled:

When curl exits non-zero, plz-curl-error, which includes the specific curl error in a plz-error struct.
When curl exits zero but the HTTP code is not 200, plz-http-error, which includes a plz-error struct containing the parsed HTTP response as a plz-response struct. (I'm guessing this may need to handle certain other status codes differently, but it's a start.)

If an error handler is given with the ELSE argument, it's called instead of signaling an error, and the same plz-error struct is its only argument. So errors always signal with or call the handler with a plz-error struct. So, e.g.:

(plz-get-sync "https://httpbinnnnnn.org/get/status/404")
;; Signals an error with:
;; (plz-curl-error . #s(plz-error :curl-error (6 . "Couldn't resolve host. The given remote host was not resolved.") :response nil))

(plz-get-sync "https://httpbin.org/get/status/404")
;; Signals an error with:
;; (plz-http-error .
;;  #s(plz-error :curl-error nil
;;               :response #s(plz-response :version 1.1 :status 404 :headers
;;                                         (("Access-Control-Allow-Credentials" . "true")
;;                                          ("Access-Control-Allow-Origin" . "*")
;;                                          ("Content-Encoding" . "gzip")
;;                                          ("Content-Type" . "text/html")
;;                                          ("Date" . "Wed, 23 Oct 2019 10:39:09 GMT")
;;                                          ("Referrer-Policy" . "no-referrer-when-downgrade")
;;                                          ("Server" . "nginx")
;;                                          ("X-Content-Type-Options" . "nosniff")
;;                                          ("X-Frame-Options" . "DENY")
;;                                          ("X-XSS-Protection" . "1; mode=block")
;;                                          ("Content-Length" . "198")
;;                                          ("Connection" . "keep-alive"))
;;                                         :body "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 3.2 Final//EN\">\n<title>404 Not Found</title>\n<h1>Not Found</h1>\n<p>The requested URL was not found on the server.  If you entered the URL manually please check your spelling and try again.</p>\n")))

;; Async error handler:
(plz-get "https://httpbin.org/get/status/404"
  :as 'string
  :then (lambda (s)
          (message s))
  :else (lambda (err)
          (message "OOPS: %S" (plz-response-status (plz-error-response err)))))

One of the things I enjoy about not making elfeed-curl a general URL fetching library is I get to take some shortcuts. For example, I don't quite fully parse the response headers. Elfeed is the only consumer of those headers, and it doesn't need them to be carefully parsed, so it doesn't matter. But for a general library, you perhaps want to follow through. Technically I believe you're supposed to decode header text values as ISO-8859-1, and perhaps also handle RFC 2047 MIME decoding. (How much does curl handle here already? I don't know, but you can probably rely on the headers being at least syntactically valid.)

Thanks, I'll look into those issues.

The machinery to make this work is a bit complicated as you've seen (random delimiter "tokens"), so you may not think it's worth the trouble for your library, but it is important for some applications, like Elfeed.

I think I'll save that for the future, but it sounds like a feature that ought to be supported. Ideally this library would be efficient and support cases like that.

Similar arguments can be made for last-modified and etag headers, which reduce the load on servers and is just good netiquette. Elfeed handles these headers at a higher level, and it's probably also out of scope for plz for the same reason. But it's something to consider.

Hm, yeah, I feel like that's inching toward being a browser[0], haha. But if it's needed, it should be supported, so I'll see what happens. I'll think about ways to better expose headers than only through the plz-response struct.

Returning a string of content is certainly simpler, and it's not wrong per se. I prefer the buffer-passing method of elfeed-curl purely for the sake of efficiency. Unlike strings, buffers can be "sliced" using narrowing, so there's less garbage, fewer allocations, and less copying when presenting a narrowed process output buffer to callbacks. In Elfeed's case, the result needs to be stored in a buffer anyway for the XML parser, so strings are an even less attractive option.

I've changed the signatures of the get functions to let the user select the type of argument he wants returned or passed to the callback. I hope this API is clear and easy to understand, but please let me know what you think. Here are some examples from the tests:

;; Call callback with response body as a decoded string
;; (decoding is automatic unless ":decode nil" is used).
(plz-get "https://httpbin.org/get"
  :as 'string
  :then (lambda (string)
          (setf test-string string)))

;; Call callback with response body as a parsed JSON object.
;; json-read is called in the response buffer with it narrowed to the
;; response body, after decoding it.
(let* ((test-json)
       (process (plz-get "https://httpbin.org/get"
                  :as #'json-read
                  :then (lambda (json)
                          (setf test-json json)))))
  test-json)

;; Return the response body as a non-decoded, binary string (in this
;; case, a JPEG image).
(image-type-from-data
  (plz-get-sync "https://httpbin.org/image/jpeg" :decode nil))
;; => jpeg

Next on the to-do list is probably POST requests and file uploads. At that point I can probably start trying to use it in my own packages, like matrix-client (which currently uses request for uploads and url for everything else, due to issues in each).

I know you're busy, so I appreciate your feedback!

0: This is mildly amusing, funny how easy it is in Emacs:

(defun plz-browse (url)
  (interactive "sURL: ")
  (let ((dom (plz-get-sync url
               :as (lambda ()
                     (libxml-parse-html-region (point-min) (point-max) url)))))
    (with-current-buffer (generate-new-buffer "*plz-browse*")
      (shr-insert-document dom)
      (pop-to-buffer (current-buffer)))))

skeeto commented 5 years ago

You don't necessarily need to split async and sync into different functions if you instead make these two modes return the same type. For example, perhaps it always returns a struct representing the request, and the process object is a slot on that struct. For sync it's always the dead process object, and for async it's a live or dead process object. Once the request is resolved, perhaps it's even populated with the results (status, headers, maybe the content, etc.).

I should have commented on this before, but I think it's a little dubious for plz to make a decision about a successful HTTP response being an "error" or not. That may depend on the context and the nature of the request. For example, a "410 Gone" might not truly be an error, but rather a signal to the application that some resource has removed, and the application is expecting this result eventually. A failure to get a complete, valid HTTP response (DNS failure, protocol violation, certificate issues, timeouts, connection errors, etc.) is definitely an error, though.

Side note: Did you notice that the curl error codes are all less than 100? That's no accident. It means you can use curl's own "0xx" status codes in-band with HTTP error codes. Every possible error has an associated, unique numeric code. Everything from 001 to 099 are hard errors.

I see for :as you allow "buffer" or a function that's called with a narrowed buffer. Narrowing is a temporary and exclusive state, which is why it makes sense to invert control and narrow for the duration of a callback. Here's an alternative to explore: indirect buffers. You could create an indirect buffer backed by the curl output buffer. Indirect buffers can be narrowed independently (I think?). Unlike when control is inverted, the caller still needs to remember to clean up when done (kill the buffer), though.

Have you played around with my aio library? If there was an aio wrapper around plz-get, aio-plz-get, then your final example could be essentially the same as you've written it, but fully async. The "(plz-get-sync ...)" just becomes "(aio-await (aio-plz-get ...))".

alphapapa commented 1 year ago

@skeeto I've finally made an aio wrapper for plz. It seems to work well. Please see:

Since plz is on GNU ELPA and aio is on MELPA, I don't know if I could reasonably put an aio-plz package on GNU ELPA. Would you be willing to consider moving aio to GNU ELPA, or to non-GNU ELPA?

alphapapa / plz.el

Feedback? #1