clj-commons / aleph

Asynchronous streaming communication for Clojure - web server, web client, and raw TCP/UDP
http://aleph.io
MIT License
2.54k stars 241 forks source link

HTTP/2 server #687

Closed KingMob closed 7 months ago

KingMob commented 10 months ago

The code is ready for review, though there are admittedly, some details to work out and tests to fix.

Some things to know:

  1. There's extremely little continuity in the Netty code, so a lot of things are more different than they need to be. Some of that is genuinely caused by how HTTP/2 has a completely different binary format and multiplexed streams, unlike HTTP/1, but a lot is just Netty doing it differently over time.

  2. Organizationally, truly common code has been moved to a.http.common. Code specific to HTTP2, but shared between client and server is in a.http.http2. HTTP1-specific code is left wherever it was, in a.http.core/server/client. Public functions in a.http.server and a.http.client should handle both HTTP versions. This is messier than I'd like, but since we never hid the http sub-namespaces, there's a chance someone's code is using them. Otherwise, I'd have made a much cleaner design.

  3. Missing features: this does not support proxies or multipart. Multipart is kind of pointless in HTTP2 (just open a new stream for each part), but understandably some people want to keep using it. I think it's fine to tell them to wait, since Netty lacks all support for it in HTTP2, and we'd have to figure something out. I was working on shifting to the HTTP1 pathway when multipart is used, but never finished.

  4. pipeline-transform - fundamentally this assumed a single pipeline, but multiplexing breaks that. We now have 2 types of pipeline: a general connection pipeline that handles common concerns like SSL and HTTP frame decoding, and many (identical) stream pipelines that each handle one stream, aka a request/response pair. There's no safe way to know what the user intended with pipeline-transform, so I think perhaps two new params (conn-pipeline-transform and stream-pipeline-transform?) should be added, and an error should be thrown if the user supplies only pipeline-transform, and is using http2. It's not ideal, but it will prevent subtle problems.

  5. Some good news: While I think the multiplexing and Netty changes added some complexity, I believe the fundamental core is actually simpler in the HTTP2 code, because requests/responses are isolated in their own streams. There's no crazy state-tracking to know when one request/resp is finished, nor are there multiple queues feeding in/out on the client side.

  6. The broken tests are a mix of fragile, weird, and different.


This change is Reviewable

KingMob commented 8 months ago

Weird, @DerGuteMoritz , I can't see your comments that I got an email for. Did you delete them?

The comment is a bit redundant now given that we're in the :dev profile anyway

IIRC, that comment is a reminder to not try and use BouncyCastle in production, since it's apparently quite slow. I forget why now, but I was having some issue getting JDK/OpenSSL to generate certs for some reason...

Not generally true: If you control your clients and supply them with your self-signed server certificate, it's perfectly fine in production, too

That's a good point, I'll soften the language.

arnaudgeiser commented 8 months ago

I don't know if it's on me or not, but I'm not able to either run lein test nor start a REPL with Leiningen 2.8. It always ends up with a StackOverflowError. (I don't have this issue with current master)

❯ lein test             
Pathological dependency tree detected.
Consider setting :pedantic? false in project.clj to bypass.
Execution error (StackOverflowError) at (REPL:1).
null

Full report at:
/tmp/clojure-12621424673382473380.edn

I bumped to Leiningen 2.10.0 I don't have this issue anymore.

KingMob commented 8 months ago

@arnaudgeiser Never seen that problem, but I'm already on lein 2.10.0, for no idea how long. What was in that report?

arnaudgeiser commented 8 months ago

Anyway, it's not a "problem" since bumping to 2.1.0 is enough to fix it. But there is something that changed underneath with this PR and I don't have a problem with any other projects I'm working on.

❯ cat /tmp/clojure-12621424673382473380.edn
{:clojure.main/message
 "Execution error (StackOverflowError) at (REPL:1).\nnull\n",
 :clojure.main/triage
 {:clojure.error/class java.lang.StackOverflowError,
  :clojure.error/phase :execution},
 :clojure.main/trace
 {:via
  [{:type java.lang.StackOverflowError,
    :at [clojure.lang.RT boundedLength "RT.java" 1793]}],
  :trace
  [[clojure.lang.RT boundedLength "RT.java" 1793]
   [clojure.lang.AFn applyToHelper "AFn.java" 148]
   [clojure.lang.RestFn applyTo "RestFn.java" 132]
   [clojure.core$apply invokeStatic "core.clj" 669]
   [clojure.core$mapcat invokeStatic "core.clj" 2787]
   [clojure.core$tree_seq$walk__6403$fn__6404 invoke "core.clj" 4934]
   [clojure.lang.LazySeq sval "LazySeq.java" 42]
   [clojure.lang.LazySeq seq "LazySeq.java" 51]
   [clojure.lang.RT seq "RT.java" 535]
   [clojure.core$seq__5419 invokeStatic "core.clj" 139]
   [clojure.core$concat$fn__5510 invoke "core.clj" 727]
   [clojure.lang.LazySeq sval "LazySeq.java" 42]
   [clojure.lang.LazySeq seq "LazySeq.java" 51]
   [clojure.lang.RT seq "RT.java" 535]
   [clojure.core$seq__5419 invokeStatic "core.clj" 139]
   [clojure.core$concat$cat__5512$fn__5513 invoke "core.clj" 736]
   [clojure.lang.LazySeq sval "LazySeq.java" 42]
   [clojure.lang.LazySeq seq "LazySeq.java" 51]
   [clojure.lang.RT seq "RT.java" 535]
KingMob commented 8 months ago

@arnaudgeiser I found this in Leiningen's NEWS.md. Looks like it was fixed in 2.9.9

Fix a bug in pedantic checks which resulted in infinite loops.

KingMob commented 7 months ago

@DerGuteMoritz I think this is the only thing left: https://reviewable.io/reviews/clj-commons/aleph/687#-

KingMob commented 7 months ago

I'm going to merge this for now, and hold off on cutting a release until @DerGuteMoritz's http-versions PR merges

DerGuteMoritz commented 7 months ago

(Responding via GH now since Reviewable doesn't allow me to respond to a closed PR anymore it seems)

@KingMob

I'm not following you. It looks to me that raw handlers in server use a buffered stream too.

See line 463 in server.clj: s (netty/buffered-source (netty/channel ctx) #(.readableBytes ^ByteBuf %) buffer-capacity)

(They don't buffer when they see a FullHttpRequest, but that's because it's going to be opened and closed almost immediately.)

Ah damn, you're right, I mixed it up with aleph.tcp which always uses unbuffered streams, regardless of the :raw-stream? setting even (might be worth changing here, too ...). Then when I wanted to confirm my memory, I only took a quick look and indeed ended up in the branch for FullHttpRequest without realizing it. Sorry for the confusion :pray:

I don't think using an unbuffered stream will actually prevent leaks. If the user code doesn't consume all ByteBufs and free them, there will be leaks regardless of where they're located.

Indeed and even in the case of an unbuffered stream, there may still be one pending buffer there which could be leaked. Alright then!

Heh. It's less a problem of inheritance, and more about encapsulation.

Right, and implementation inheritance makes it very easy to break encapsulation accidentally!

Netty moves slowly, and there's no guarantee they'd open up the permissions just for Aleph, so I'm reserving my interactions with Netty for more important issues.

Seems wise.

Just the test suite. IIRC, the issue is the built in JDK SSL layer, BoringSSL/OpenSSL have different failure modes.

Ah good, yes, +1 on that!