envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.89k stars 4.79k forks source link

Add support for terminating QUIC #2557

Closed alyssawilk closed 3 years ago

alyssawilk commented 6 years ago

We’ve had a few off-line discussions on how to maybe go about this with @Mattklein123 but we’re now close enough it’s worth filing a tracking issue for it :-)

Here’s our plan A - any Envoy devs interested in QUIC support are encouraged to offer suggestions/improvements

Milestone 1 is to hack together “something which builds” using the existing Google QUIC code*, because as it turns out it takes years to debug all the weird corner cases in congestion control and crypto and it’s likely easier to copy existing code than write it all from scratch. That code base is “supposed to” build fairly cleanly as long as one implements all the things in quic/platform/impl but almost certainly doesn’t. This will be done in in @juvexp’s Envoy branch to allow rapid iteration - any interested parties are welcome to follow along and/or contribute there. Once the QUIC code builds and QUIC unit tests pass, it’s on to Milestone 2.

Milestone 2 is to get QUIC “working” with Envoy, where we have working integration tests. This may not include clean Envoy API use. For example, the first pass might treat QUIC as one logical Connection rather than one-Connection-per-stream Milestone 2 will likely involve landing some code in Envoy (things like UDP listeners if @cmluciano hasn’t beat us to it)

Milestone 3 is having QUIC fully landed in Envoy, with proper API wrappers for all the various codec / connection / crypto functionality QUIC supports. This will involve slowly cleaning up anything still only in @juvexp’s Envoy branch, along with having a story for gracefully handling upstream/downstream updates and contributions.

*currently visibile at https://github.com/chromium/chromium/tree/master/net/quic. Plot twist, while we’re hacking around in a custom branch the Google devs are likely to use “upstream” QUIC code, since then they can make changes to code structure and push them directly to Envoy rather than pushing from google3-upstream to Chrome to Envoy which will really slow development. Long term code syncing is a big glaring TBD as we need a library non-Googlers can contribute to, but also want to cheaply and easily get the latest security fixes and IETF spec updates from the dedicated Google dev team working on QUIC.

ggreenway commented 6 years ago

How close of a match is the quic threading model to envoy's?

Any known issues or expected pain points integrating google-quic into envoy?

mattklein123 commented 6 years ago

@alyssawilk Thanks for the great summary of the plan! Small ask: we are already tracking QUIC here: https://github.com/envoyproxy/envoy/issues/1193. Can we consolidate issues?

alyssawilk commented 6 years ago

@ggreenway Threading is fine - the was written for GFE (Google's HTTP proxy) and Chrome, which have compatible threading models with Envoy.

I think I covered the main pain points above. Pain point one being that the code isn't architected well for Envoy APIs (which we'll fix). And pain point two being "external deps" - while we tried to abstract away non-QUIC-core concepts like "how do you implement your IP Address" and "how are alarms managed" into the platforms directory, we really only abstracted things that GFE/Chrome didn't have in common. Given they're both predominantly Google code, I suspect we'll run into many things we haven't yet abstracted enough (like logging). Once we get it building I think the APIs will come fairly quickly, and the really tricky parts of the code (crypto and congestion) shouldn't be a problem.

Other than that, well almost all the QUIC code is open sourced, but not all. I think our UDP proxying is GFE-specific enough we may want to reimplement from scratch, though we may have some useful utils and definitely will have input on how things can go wrong :-P. We'll also need to open source a bunch of the perf optimizations the current toy server didn't need, like reading from rx_ring with berkley packet filters and writing to raw sockets so perf won't suck for bandwidth-heavy users.

alyssawilk commented 6 years ago

@mattklein123 your call. I liken the two to "TCP proxying" and "HTTP termination" so I think they're different enough to warrant different tracking bugs. I also think they can be done substantially in parallel - I'm hoping @cmluciano will actually pick up UDP proxying while mpw@google/@juvexp will probably do the bulk of this one.

mattklein123 commented 6 years ago

@alyssawilk if you think the issues are different no problem we can keep both.

juvexp commented 6 years ago

Sorry for the lack of updates. I worked on this only a little bit since the bug was filed:

  1. Created a personal fork: https://github.com/juvexp/envoy
  2. Wrote a script to export QUIC code from Google to my personal fork.
  3. Added the implementation for a few QUIC platform APIs.

I plan to get back to the QUIC platform implementation in the next week, hopefully I can get it done by the end of this quarter. I'll send another update after that.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

ggreenway commented 6 years ago

@juvexp Any updates on your progress?

rpaulo commented 6 years ago

Will this work involve supporting Google QUIC or IETF QUIC? Since QUIC is being standardized by the IETF, I think it's more beneficial to spend time on IETF QUIC.

bmetzdorf commented 6 years ago

I agree, we should be doing IETF QUIC.

mattklein123 commented 6 years ago

The plan is to support both Google QUIC and IETF QUIC. Google QUIC is the fastest path to something that is production ready while continuing to work towards the IETF standard.

vinothchandar commented 6 years ago

Hi.. is someone working on bringing QUIC support to envoy? I saw a "help wanted" tag on https://github.com/envoyproxy/envoy/issues/1193 as well.. Just want to understand the state of things.

alyssawilk commented 6 years ago

AFIK the Google-QUIC team that had picked this up got diverted from working on it, but are hoping to get back to it and land QUIC-Envoy integration in Q1. That said I wouldn't say that's anywhere near guaranteed - if you have interest or cycles in helping out help would definitely be appreciated :-)

vinothchandar commented 6 years ago

Thanks for the prompt response.. Is there any POC from Google's side that can give us more context? For e.g, if this needs changes to Chromium. (we contributed a small reverse proxy to chromium quic and had to move around some classes to make way)

Definitely love to contribute, once we flesh out any dependencies :)

alyssawilk commented 6 years ago

Well I'm the one who generally pings the QUIC team when there's questions but @ianswett @RyanAtGoogle who are actually deciding who does what.

For contributions, I think we're going to want to reuse the google-QUIC code where we can for congestion control and crypto because (there's so many subtle things to get wrong and it'd be nice to not repeat mistakes!) but AFIK no one ever got around to open sourcing the internal proxy implementation, so if you're interested in doing QUIC work that'd be a great place to start. I'd be happy to chat about how that work might go when/if you have cycles as I have quite a bit of context.

RyanTheOptimist commented 6 years ago

As @alyssawilk alluded to, we're currently working to take Google's QUIC code which is shared between Chromium and our internal proxy and export it to a stand-alone repository where it should be easier to consume for other projects. This repository will be DEPS'd into chromium (meaning we'll pull the contents of this repo directly instead of having duplicate or slightly modified files in Chromium). Consumer of this stand alone repo (of which Chromium will be one) will need to provide their own "platform impl" for the various platform specific QUIC dependencies. https://cs.chromium.org/chromium/src/net/third_party/quic/platform/impl/ This includes things like socket and clock abstractions, various compiler configurations, logging, etc. It will not require consumers to pull in all sorts of chromium dependencies.

Hope that helps...

conqerAtapple commented 6 years ago

Has anyone looked at https://github.com/ngtcp2/ngtcp2 ? Their event loop model fits envoy so might be a natural fit.

vinothchandar commented 6 years ago

@alyssawilk Thanks. agree on reusing chromium impl as much as possible for the same reasons. On flip side, something like ngtcp2 is very promising but not sure how production hardened it is, at this point. Given envoy is used by so many, it may be a lil premature to pull it into envoy and certify support.

@RyanAtGoogle The standalone repo makes total sense.. I remember you moved a lot of code already for our chromium proxy work. Is there more to be done there? In other words, is it just moving it to a new repo or there are more refactoring work that needs to be carefully orchestrated by google engineers to ensure the internal proxy continues to work? If latter, then it may not be easy for us (outside of google) to get started on this work? Thoughts?

conqerAtapple commented 6 years ago

How do you compare "production ready" between refactored(untested) code vs something like ngtcp2?

RyanTheOptimist commented 6 years ago

@vinothchandar The primary work remaining is to remove a whole pile of minor diffs between the internal and external versions of the code which have crept in over the years. This is a raft of CLs like:

https://chromium-review.googlesource.com/c/chromium/src/+/1273814

Once that's complete it's mostly a simple matter of just moving the files to a new repo. We do not believe there is any careful refactoring to be done as part of this effort like there was when trying to add reverse proxy support to the toy server.

@conqer I'm not sure if your question was about the Google QUIC repo or something else. If it was about the Google QUIC repo, that will contain the core QUIC implementation which is used for Chrome (and various Google apps which use Chrome's networking library) as well as our internal proxy. So this is very much tested production code.

mattklein123 commented 6 years ago

@conqer @vinothchandar I'm out for two weeks but sorting out a firm plan for QUIC is high on my priority list when I get back. If you would to be involved please send me email at mklein@lyft.com so that we can coordinate and we can learn how you might want to contribute.

From my perspective, both https://github.com/ngtcp2/ngtcp2 and https://github.com/h2o/quicly are on the table is viable options (nothing has been firmly decided yet). Everything will be hidden behind an interface no matter what, so Google's code can be swapped in if it can't be extracted into a library in the time that we want to get this done.

As a pre-requisite to this work I'm probably also going to start helping out on basic UDP proxy support since no progress has been made on that. (https://github.com/envoyproxy/envoy/issues/492)

conqerAtapple commented 6 years ago

@mattklein123 sent you email.

vinothchandar commented 6 years ago

@RyanAtGoogle that sounds promising.. We can try our hand at this then, looks like. If we atleast have a PoC version with Chromium working, then may be as bad to redo it with the new repo/independent lib. Correct me if I am missing something.

@mattklein123 yes will send you an email to coordinate. It'd be awesome if you can also get @alyssawilk looped in. I have reasonable handle on the chromium code base for this, but total newbie to envoy. So appreciate all your help.

mattklein123 commented 6 years ago

@mattklein123 yes will send you an email to coordinate. It'd be awesome if you can also get @alyssawilk looped in. I have reasonable handle on the chromium code base for this, but total newbie to envoy. So appreciate all your help.

Yup we won't do anything in isolation of the Google folks as they are an integral part of this. I'm going to meet with Google right when I get back and then we will pull together a larger meeting with interested folks and get work assigned.

Thank you everyone for being willing to help!

RyanTheOptimist commented 6 years ago

@vinothchandar Oh, that'd be exciting! If you're able to cook up a PoC based on the chromium code, I would imagine it would be quite easy to convert it to use the new repo. If you start on this effort, please let me know if I can help out...

vinothchandar commented 6 years ago

Will give it a shot and def would need your help, along the way. :) Will loop you in once we have a basic plan.

mattklein123 commented 6 years ago

@vinothchandar just to set expectations, this is too big of a feature to go off and implement and expect to have it merged without prior agreement on the plan. So please circle back before you do too much work. As I said I hope to pull together a larger meeting in a few weeks for all interested parties where we can develop a firm plan. Thank you!

mattklein123 commented 5 years ago

Alright folks, here is a small update after a meeting that I had with the Google QUIC folks today:

For now I'm going to do project management for this feature unless someone else wants to step up. Some general logistics as this is our first substantial feature that is going to span multiple organizations.

I will go ahead and start populating the board with issues that we already have opened. I think in the short term the main work streams are:

1) Basic UDP work 2) FD refactor 3) QUICHE library (external to this project currently)

Please let me know if I missed anything @envoyproxy/quic-dev!

conqerAtapple commented 5 years ago

Is there a work item/plan for the the abstraction layer for QUIC? This is so that we can hook in different implementations of QUIC.

mattklein123 commented 5 years ago

Is there a work item/plan for the the abstraction layer for QUIC? This is so that we can hook in different implementations of QUIC.

I think this falls under the "Envoy QUIC listener, HCM, filter stack" item in which we will get an MVP of QUIC/TLS/H2 termination working. As part of this work we will create an interface much like the existing HTTP codec to hide the actual QUIC code from the rest of Envoy, which should make it theoretically possible to swap in a different QUIC implementation if desired.

conqerAtapple commented 5 years ago

Great! Thanks for setting this up Matt.

alyssawilk commented 5 years ago

Matt: updated owners inline on your comment above. I'm hoping in the medium term Mike or Dan will own project management but I'm happy to help you on the project management front as they skill up in Envoy

conqerAtapple commented 5 years ago

Is there a design doc or API outline for the abstraction layer? I assume the plan right now is to have some shape of the API visible by end of Milestone 2. Also assuming that we can only plan on hooking another QUIC implementation after that milestone.

mpwarres commented 5 years ago

I think the plan for the first cut is that at the lowest layer, there is the UDP listener being worked on in #4898, on top of which sits an HCM-like layer that speaks QUIC and exports an HTTP codec interface upwards.

mattklein123 commented 5 years ago

@conqerAtapple I think a QUIC codec interface will fall out from the MVP work when we pull it all together. Since I'm guessing the QUIC code you want to plug in is not going to be public, the best thing to do will be to collaborate on those reviews to figure out the right integration points that will work.

mpwarres commented 5 years ago

Here's a doc providing a somewhat more concrete design sketch, and (no doubt incomplete) enumeration of sub-issues to be resolved.

vinothchandar commented 5 years ago

Looks promising! Thanks for sharing!

Don't know a lot about envoy code, so can't add anything very useful (FWIW structuring seems similar to the chromium reverse proxy). Just left one comment on connection migration.

mpwarres commented 5 years ago

Here is a spreadsheet that we are using to coordinate work porting the QUICHE platform impl (i.e. platform abstraction layer).

frcai commented 4 years ago

how's going on QUIC support on Envoy? do we have rough timeline for Beta release?

danzh2010 commented 4 years ago

v1 depends on https://github.com/envoyproxy/envoy/pull/8496, and v2 ongoing. V2 with packet tossing and BPF support is expected to be done by end of Dec.

frcai commented 4 years ago

v1 depends on #8496, and v2 ongoing. V2 with packet tossing and BPF support is expected to be done by end of Dec.

8496 already merged, can we try v1 now? any document we can refer to enable quic on Envoy?

danzh2010 commented 4 years ago

@wu-bin is working on doc/instruction/sample config of turning on QUICHE in Envoy. One thing I need to point out is that we don't support cert verification yet. The proof source/verifier is a fake one. @bencebeky will add actual impl soon. I would add an integration test for multi-worker to verify our implementation also works with SO_REUSEPORT https://github.com/envoyproxy/envoy/pull/8884 within 1 or 2 weeks.

danzh2010 commented 4 years ago

As QUIC integration is functionally working, we are trying to list out all the things that need to be done before we can run this feature in production. I created a full list here, among which I think below are the blockers which we definitely need to add before QUIC runs in production:

  1. UdpListener switches from recvmsg to recvmmsg.
  2. Tearing down QUIC listener gracefully: GOAWAY and ConnectionClose before destroying UdpListener.
  3. Implement platform API QUIC_SERVER_HISTOGRAM_ENUM, and move some varz in QuicDispatcherStatsCollection to QUICHE platform API, ie: b/140720701
  4. Restore QUIC test coverage in Envoy
  5. Real certs verification (@bencebeky )

And below is what's good to have but not blocking us from trying in production:

mattklein123 commented 4 years ago

Awesome @danzh2010, thanks for the summary. Very exciting stuff. The only thing I would add is that once we feel that the code is alpha/MVP ready, we need to do a doc pass and actually add documentation on how to configure/operate/etc. I can work with you offline and help with that as needed. Thank you!

danzh2010 commented 4 years ago

Here are the remaining action items which are blockers to run in production:

And below are good to have features for production debugging:

  1. Implement platform API QUIC_SERVER_HISTOGRAM_ENUM, and open source QUICHE internal stats to platform API.
  2. Implement Quic feature flags using Envoy::Runtime to support flipping flags at runtime.(@nezdolik)
ggreenway commented 4 years ago

Tracking remaining issues with label quic-mvp: https://github.com/envoyproxy/envoy/issues?q=is%3Aopen+is%3Aissue+label%3Aquic-mvp

alyssawilk commented 3 years ago

Given QUIC alpha, I'm going to close this off as done - folks can follow along the mvp list https://github.com/envoyproxy/envoy/labels/quic-mvp