clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

netty: do not push bindings on operation-complete #652

Closed arnaudgeiser closed 1 year ago

arnaudgeiser commented 1 year ago

Description

This removes the usage of bound-fn* which has been introduced by the following PR. [1] It was added for safety while there might be no open issue regarding it. For some workloads, this operation can take up to 40% of CPU.

Here is the reason why we used to push the bindings [2], but I suppose we fixed it by using ensure-dynamic-classloader instead.

image

Here is the gist of the discussion we had with @KingMob :

Oh yeah, because the code runs on netty and signal threads that aren’t started by the Clojure test, so there’s no automatic thread frame propagation, defeating the use of binding under the hood. Well, the usual solution to interop with Java threads is bound-fn , and that works here. Truthfully, we could probably use bound-fn more when handing things off to Netty to run, it’s always safer. Anyway, this works:

(deftest test-classloader
  (testing "classloader: ensure the class loader is always a DynamicClassLoader"
    (let [result (CompletableFuture.)]
      (with-dynamic-redefs [netty/operation-complete (partial operation-complete result)]
        (let [server (http/start-server
                      (constantly {:body "ok"})
                      {:port 9999})]
          (on-signal :int
                     (bound-fn [_]
                       (.close ^java.io.Closeable server)))
          (.exec (Runtime/getRuntime) (format "kill -SIGINT %s" (pid)))
          (is (= (deref result 30000 ::timeout) true)))))))
(defn wrap-future
  [^Future f]
  (when f
    (if (.isSuccess f)
      (d/success-deferred (.getNow f) nil)
      (let [d (d/deferred nil)
            bound-operation-complete (bound-fn* operation-complete)]
        (.addListener f
          (reify GenericFutureListener
            (operationComplete [_ _]
              (ensure-dynamic-classloader)
              (bound-operation-complete f d))))
        d))))

[9 h 28] The use of bound-fn in the test ensures it inherits the dynamic redef frame, and bound-fn* in wrap-future ensures the listener will work even if .close->wrap-future is called from a signal thread [> 9 h 29] We shold probably check everywhere we accept user callbacks for netty and wrap them in bound-fn for safety

[1] : https://github.com/clj-commons/aleph/pull/604 [2] : https://github.com/clj-commons/aleph/pull/425

KingMob commented 1 year ago

Just breezing through on a spare moment. I haven’t had time to take a closer look, but I did notice one mistake: the Clojure classloader isn’t responsible for thread frame propagation. They’re separate issues, and using the classloader alone isn’t sufficient. Even with the classloader in place, binding frame propagation requires the use of Clojure fns to start threads; with Java code, frames won’t be propagated.

arnaudgeiser commented 1 year ago

Yes, I totally agree with your observation and your conclusion. The question is more: Do we really need thread bindings propagation? That's unlikely people using an asynchronous mechanism expect to have the bindings of the calling threads propagated down, right? And we are talking only about dynamic bindings, here, right?

I mean, a lot of people are not relying on manifold and use CachedThreadPool or FixedThreadPool directly without trouble. I have the impression that "we" (not us actually) have done a lot of tinkering around this issue in the past... I would like to reconsider the status-quo.

KingMob commented 1 year ago

OK, I took a closer look to refresh my memory, and I'd forgotten why we started frame propagation in the first place.

While it's true that the classloader is not responsible for frame propagation, I had the direction reversed. It's the classloader we care about, not frame propagation per se. We were only propagating frames to try and make the LOADER var used by the compiler available to the signal handler thread. (I think.)

We need the Clojure class loader set for all netty threads (not just the signal handler thread) but we switched from setting the LOADER var, to using .setContextClassLoader on the thread. I think that should be sufficient. The comments say we use bound-fn* to "Ensure the same bindings are installed on the Netty thread (vars, classloader)", except... the operation-complete fn we're binding doesn't use any dynamic vars itself. Maybe there was a concern that the call to .getNow() would trigger some code, but I just looked at all definitions in Netty, and that's not an issue.

All of which is to say, yeah, we can probably stop using bound-fn on operation-complete

arnaudgeiser commented 1 year ago

All of which is to say, yeah, we can probably stop using bound-fn on operation-complete

Perfect, polishing the PR then.

arnaudgeiser commented 1 year ago

For the record @KingMob, here is where everything started : https://github.com/clj-commons/aleph/pull/425, where we put two mechanisms in place.