envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.75k stars 4.75k forks source link

Kafka protocol filter #2852

Closed mattklein123 closed 4 years ago

mattklein123 commented 6 years ago

It's looking like Lyft may be able to fund Kafka protocol support in Envoy sometime this year.

Community, can you please chime in on what you would like to see? I know this will be a very popular feature. Stats (as in the Mongo filter) are a no-brainer. What else? Eventually routing and load balancing for languages in which the Kafka client drivers are not as robust?

gwenshap commented 6 years ago

First, this is totally awesome. Second, as someone with Kafka experience but rather new to service meshes, I have two types of suggestions: I have some additional features that I'd want to use if I had a Kafka-Proxy and then I also have a very unbaked suggestion for a different way to look at Kafka/Mesh integration that I'd like to discuss.

Let's start with additional features (not all are my idea, folks from Confluent helped!):

Here's the other point-of-view: Kafka isn't just a service, "kinda like a database", Kafka is a way of sending messages from one service to another. Kinda like a transport layer, except async and persistent. I wonder if Envoy can integrate with Kafka even deeper and allow services to use Kafka to communicate with other services (instead of REST, gRPC, etc). And then you can "hide" transition from REST to Kafka communication in the same way Lyft used Envoy to move to gRPC. Not sure if this makes total sense, since async programming is different, but worth mulling over.

travisjeffery commented 6 years ago

This would be dope. A couple use cases to start are request logs and stats. You can also build a nice audit log by taking your request logs and enriching them with the users's info. This could also help people write their own Kafka filters adding features like upconverting old Kafka clients to newer protocol versions.

mattklein123 commented 6 years ago

Thanks @gwenshap those are all great ideas. Would love to discuss more. If Confluent is potentially interested in helping with this (even if just design) can you or someone else reach out to me? My email address is easy to find or you can DM me on Twitter to connect.

mattklein123 commented 6 years ago

Other interesting ideas that come to mind:

theduderog commented 6 years ago

@mattklein123 Do you mind explaining the primary use case you had in mind? Would this be a "Front Envoy" that might be used for ingress into Kubernetes? Or would a side car proxy pretend to be all Kafka brokers to local clients?

By "Front Envoy", I mean something like slide 15 in your deck here.

wushujames commented 6 years ago
alexandrfox commented 6 years ago

Awesome that you guys are looking into Kafka protocol support, that'd be an amazing feature to have! +1 to @gwenshap and @wushujames ideas, also:

sdotz commented 6 years ago

Here are some ideas I would find useful (some already mentioned)

mbogoevici commented 6 years ago

Between @mattklein123 @gwenshap and @wushujames this is an awesome list of features.

As a general question, particularly for Matt: would you see any value in capturing some of the more generic features and turning them higher level abstraction for messaging support in the service mesh?

sdotz commented 6 years ago

Perhaps also look at some of what kafka-pixy does. I find the wrapping of Kafka's native protocol into with REST/gRPC to be pretty compelling. This better supports usage from FaaS and apps that don't necessarily have the ability to do a long-lived connection.

rmichela commented 6 years ago

I'd like to see Envoy's Zipkin traces reported to Zipkin using Zipkin's Kafka collector.

mattklein123 commented 6 years ago

Thanks everyone for the awesome suggestions that have been added to this issue. From Lyft's perspective, we are primarily interested in:

So I think this is where we will focus, probably starting in Q3. I will need to go through and do some basic SWAGing in terms of how much existing code in https://github.com/edenhill/librdkafka can be reused for the protocol parsing portion. We will also coordinate with folks at Confluent on this work as well. Please reach out if you are also interested in helping.

ebroder commented 6 years ago

Are there any plans at this point for how to practically proxy the Kafka protocol to a pool of brokers? In general, clients connect to a seed node and send it a "metadata" request for the topic/partition they're interested in. The response to that includes a hostname and port, which clients then connect to directly. It means that in practice Kafka clients are (by design) very good at dis-intermediating proxies.

gwenshap commented 6 years ago

@ebroder One way to do it will be to register the proxy address (probably localhost:port if we are using sidecar) as their advertised-listeners. And then they'll return this address to the clients. In the latest release, advertised hosts will be a dynamic property, so this may become even easier to manage.

wushujames commented 6 years ago

@gwenshap: Interesting. That would imply a Kafka cluster that would only work with the sidecars then, right?

gwenshap commented 6 years ago

I didn't mean to imply that. That's why I said "one way". I know @travisjeffery and @theduderog have ideas about central proxies. Sidecars do seem to be Envoy's main mode of deployment.

ebroder commented 6 years ago

That does require that you have to allocate a sidecar port for every kafka broker you're running, right? It seems like the overhead/management costs there could potentially add up quickly

gwenshap commented 6 years ago

I'm not sure? How expensive are ports? Kafka clusters with over 50 brokers are quite rare.

mattklein123 commented 6 years ago

@ebroder @wushujames @gwenshap TBH I really have not gotten into the details yet. If the Kafka protocol does not support a built-in method of proxying (we should discuss), I think there are a few options:

But again I haven't done any investigation. I was going to carve out some time to learn more about all of this in Q2 and potentially find some people who would like to help me learn more about it. :)

wushujames commented 6 years ago

@mattklein123 @gwenshap @ebroder: Yeah, I had the same idea as Matt's option 3. Since the initial request to the brokers has to flow through the sidecar anyway, it can intercept and rewrite the response back to the client, and transform the request/responses as they flow between client/broker. Sounds expensive to me, but I know very little about envoy's performance.

ilevine commented 6 years ago

@mattklein123 @gwenshap @ebroder @wushujames: take a look at https://medium.com/solo-io/introducing-gloo-nats-bring-events-to-your-api-f7ee450f7f79 & https://github.com/solo-io/envoy-nats-streaming - love to get your thoughts .... we created a NATS filter for Envoy ...

AssafKatz3 commented 6 years ago

As @wushujames mentioned:

automatic topic name conversion to/from a cluster. Like, an app would publish to topic foo, and it would actually go to topic application.foo. This would allow multi tenant clusters, but the application would think they have the whole namespace to themselves.

This will be very useful for cannary release or blue/green deployment since will allow to modify the actual topic without any change in application.

georgeteo commented 6 years ago

@mattklein123: There have been a lot of requests in this thread. Will there a design doc with a list of which requested feature will be supported?

mattklein123 commented 6 years ago

@georgeteo yes when I start working on this (unsure when) I will provide a design doc once I do more research.

alanconway commented 6 years ago

This may have some structural similarities to the AMQP support I'm working on #3415. It will probably also need to use upstream fliters #173. Raising this so we can watch for opportunities to reuse/co-operate on common infrastructure features in Envoy that support both cases.

stale[bot] commented 6 years ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions.

chenzhenyang commented 6 years ago

already solved?

miketzian commented 6 years ago

Lots of great points above, I'd add that ideally the service discovery should be able to pick broker/topic combinations based on client requests for data - so that a consumer/producer could ask envoy for a "log publisher" (or whatever) and be directed to the closest brokers + their configured topic.

This is potentially very powerful, keen to see this!

RomansWorks commented 6 years ago

Would it be possible to provide producer backpressure support over Kafka in the proxy? I'm thinking of something like: Config - per producer / topic pair - list of consumer groups for lag calculation; lag limit. Runtime - if sum of lag in configured consumer groups exceeds lag limit, prevent producing.

kumarsubodh commented 6 years ago

Something on authorization side. We were excited to find sidecar enabling complex authorization patterns at message level, Service A can only consume this message from Service B, if an only if.... With Kafka being message moved between services, we lose the advantage of sidecar, as now kafka becomes responsible for which message can be consumed by which service. Kafka authorization is pretty coarse (topic level, feel free to correct). If kafka proxy can help with fine grained authorization, kafka can fit as messaging layer between services without losing the benefits of sidecar. Is it too late to consider this? We are debating if we keep kafka as messaging layer (asynchronous communication) or switch to synchronous communication due to this limitation.

sirajmansour commented 6 years ago

@mattklein123 dumping incoming traffic to a kafka stream is a killer feature, we're looking for something like this as a collector for our analytics platform.

adamkotwasinski commented 6 years ago

@mattklein123 I think we could consider pure-Envoy (one proxy in front of whole cluster) or Istio-Envoy installations (or to be more generic, one proxy in front of each Kafka broker) separately.


With pure Envoy installation, we have the proxy sitting in front of the whole Kafka cluster, with client being aware only of proxy's address (makes it similar to current Redis filter).

bootstrap.servers = envoy-host:9092

In such a case it is necessary to decide how much of underlying kafka do we want to expose to end clients (and what would be the implications).

Kafka client connects to multiple brokers, but it should do that through the proxy, what means that the proxy itself would need to re-implement a lot of features that are already provided by kafka server:

Obviously, generic inspection of kafka payloads (see below) would provide extra features.


In case of deploying Envoy as sidecar per kafka instance it gets a bit different - we do not need to amend the data that's sent by original client (as it is already reaching out to correct brokers - so it's valid from kafka's point), but we still can get a lot if we are capable of decoding / encoding payloads:

The same also applies to the producer-side, as we would know which client instance talks to which topics etc. due to the above ^. Obviously, if we are capable of deploying Envoy on both sides of the channel, then we can have TLS termination etc.

mattklein123 commented 6 years ago

@adamkotwasinski agree with your write up. I loosely plan on handling both of those cases.

FWIW, I'm taking a 2 month leave this winter where I plan on doing some part time coding and I'm going to look at getting started at implementing this.

mattklein123 commented 6 years ago

(Assuming no one else wants to start development sooner)

adamkotwasinski commented 5 years ago

(continuing the minimal sidecar-per-broker idea) Assume we have a Kafka read/write filter that can process both incoming data (requests sent to broker by clients) and outgoing data (responses).

Would it be possible then to use Envoy as egress proxy sitting as a sidecar next to client, so we could capture the client's outbound data? (it would give users visibility into client-level metrics e.g. which machines in particular are creating more load etc.).

So in the end it would look a bit like this:

[     client machine     ]                     [     broker machine     ]
[ client ] <---> [ envoy ] <-----------------> [ envoy ] <---> [ broker ]
                     |                             |
                     v                             v
                  metrics                       metrics

One thing that troubles me is that if that egress proxy would be used for all outbound comms, then we'd need to conditionally activate the filter - is it possible to state something like activate kafka filter if the outbound port == 9092 ?

solsson commented 5 years ago

Is the Envoy Kafka effort discussed with the Knative community? Pivotal has interesting ideas there on Kafka client as sidecar. Knative depends on Istio.

AssafKatz3 commented 5 years ago

@solsson Do you have more links about it?

solsson commented 5 years ago

Knative's abstraction on top of messaging systems is briefly mentioned in https://github.com/knative/docs/tree/master/eventing#buses, but afaik the implementation is being reworked.

Pivotal are backers of Knative, and maybe their mentions of Kafka were in https://springoneplatform.io/2018/sessions/introducing-knative or https://www.youtube.com/watch?v=_OiGt4QwdlM or https://content.pivotal.io/podcasts/serverless-knative-project-riff-with-mark-fisher-ep-112. Wish I had more info, but I was just guessing that Istio + Kafka could be a topic within the Knative eventing discussions, which is why my comment was only a question :)

AssafKatz3 commented 5 years ago

@solsson I found The document how to set it, but it seems still work in progress.

mattklein123 commented 5 years ago

Status here:

@adamkotwasinski is beginning work on a basic sniffing filter than can run in front of a broker. This will include encoder/decoder, stats, etc. I will help him review/merge, etc. that. Then this winter I plan on working on a full L7 client filter in which Envoy itself becomes the only broker the client knows about, and handles everything else behind the scenes.

nippip commented 5 years ago

@mattklein123 @adamkotwasinski What help can you use on this project?

mattklein123 commented 5 years ago

@nippip et al, I think we have a solid plan forward at this point which is starting in https://github.com/envoyproxy/envoy/pull/4950. I'm not sure of the status of that work (cc @adamkotwasinski) but I think the next step is to move that plan forward and start getting stuff merged.

nippip commented 5 years ago

@mattklein123 Prior to #4950's inclusion into Envoy what is the best method for using Istio in a cluster which uses Kafka as the main message broker between the other services in the cluster?

christian-posta commented 5 years ago

That's probably a more appropriate question for the Istio mailing list

On Tue, Feb 12, 2019 at 9:15 AM Pippin Wallace notifications@github.com wrote:

@mattklein123 https://github.com/mattklein123 Prior to #4950 https://github.com/envoyproxy/envoy/pull/4950's inclusion into Envoy what is the best method for using Istio in a cluster which uses Kafka as the main message broker between the other services in the cluster?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/envoyproxy/envoy/issues/2852#issuecomment-462823763, or mute the thread https://github.com/notifications/unsubscribe-auth/AADP0bmgBRU6e3xtyxTYIFBn9rlLfc2qks5vMuibgaJpZM4SxJ3j .

-- Christian Posta twitter: @christianposta http://blog.christianposta.com http://www.christianposta.com/blog

nippip commented 5 years ago

Thanks @christian-posta, I will reach out there.

kasun04 commented 5 years ago

@solsson I guess the current implementation of KNative Eventing[1] is primarily about CloudEvents over HTTP (services do not use Kafka as the communication protocol). However, I'm not quite sure that they want to use non-CloudEvent based mechanism for the service-to-service eventing.

[1] https://github.com/knative/docs/tree/master/eventing#architecture

kasun04 commented 5 years ago

@mattklein123 As far as I understand, with the above implementation, the developer experience would be as follows

Is this the pattern that we currently support?

Also, the patterns that we use to integrate Kafka with Envoy must be protocol agnostic and can equally be applied for other event-driven messaging protocols such as NATS, AMQP etc. Should we generalize the possible patterns that we use in this implementation as references for other protocols?

joewood commented 5 years ago

So great to read this thread and the thoughts around this...

One thing I would like to see (and something @gwenshap hinted at in her Service Mesh Youtube video) is using the sidecar as a way to offload a lot of the responsibilities of the existing Kafka client SDK.

Kafka's broker is very simple (by good design). For example, it's the client SDK's responsibility to rebalance consumer groups and decide which client has which partitions; it's also the client SDK's responsibility to work out which broker to communicate to by requesting metadata. All this responsibility makes the client SDK large and difficult to support across different languages and runtimes. The approach so far has been to have a fully featured JVM story, and second-class non-JVM languages calling into a C SDK. The downside to this is that non-JVM clients lag functionality - they don't have Kafka Streams support, can't be used for Connectors etc...

Envoy could change that but shifting the architecture and moving some of the client SDK responsibilities into a sidecar. The communication between the Kafka client and the sidecar would be analogous to the existing Kafka REST API protocol. The sidecar would communicate to the broker on behalf of connected consumers, handle rebalances and leader changes. The simpler consumer would poll for updates from localhost sidecar and consumer and produce events in much the same way as the Kafka REST Proxy does today. Active polling indicates a healthy client. Like the existing Kafka protocol - there wouldn't be a need for unsolicited events.

I see some clear advantages with this model:

There are definitely some gaps between the KNative Cloud Events (which are more geared towards FaaS) and the Kafka protocol (and the closely related Kafka REST Proxy protocol), but they're not insurmountable.

srinathrangaramanujam commented 5 years ago

Subscribing..

tobiaskohlbau commented 5 years ago

Hello,

I think this issue is the best place to report some regression introduced by #4950. The python script kafka_generator.py does not work with python3 (everything else within envoy default settings works for me).

https://github.com/envoyproxy/envoy/blob/fa0b52a874b67c857bd314620592d39280767330/source/extensions/filters/network/kafka/protocol_code_generator/kafka_generator.py#L238

return len(list(self.used_fields()))

As I found python3 does not return a list back from filter, I've made it work by adding an additional list() conversion. I'm not fluent in python and don't know if this is a good solution or there exists better ways. If this is fine, I'm more than happy to prepare my first contribution.

CC @adamkotwasinski