envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.65k stars 4.74k forks source link

SRV service discovery support #125

Open mattklein123 opened 7 years ago

mattklein123 commented 7 years ago

Support SRV queries (used by Consul DNS, etc.) so that DNS can return IP/port combinations.

SRV is not supported by getaddrinfo_a so need to integrate with DNS library to make the query directly.

mattklein123 commented 7 years ago

Note, a PR was started in https://github.com/lyft/envoy/pull/517 but it looks like it's not going to get finished. If anyone wants to pick it up please chime in here. cc @0x1997

ludov04 commented 6 years ago

Any update on this? The new service discovery feature on AWS ECS make use of SRV records. If envoy would support this, this would greatly simplify the use of envoy on ECS, no need no manage a consul cluster anymore.

Thanks :)!

ChrisLahaye commented 5 years ago

Any update on this? The new service discovery feature on AWS ECS make use of SRV records. If envoy would support this, this would greatly simplify the use of envoy on ECS, no need no manage a consul cluster anymore.

Thanks :)!

Are you still using Consul as alternative?

davidcpell commented 5 years ago

@ludov04 @ChrisLahaye I'd also be interested to know!

venilnoronha commented 5 years ago

/assign @venilnoronha

grayaii commented 5 years ago

I think this was supposed to be in 1.11.0 release. 1.11.1 just got released and it looks like this has been pushed to 1.12.0.

twreid commented 4 years ago

Is there any progress on this? We are using ECS service Discovery and it uses SRV records.

twreid commented 4 years ago

I actually did some digging and looks like a library c-ares is used for dns resolution and looking at it they do support SRV records. Unless I'm missing something.

ChrisLahaye commented 4 years ago

@twreid could you please let me know when you find a solution. We would also like to use envoy with SRV-based service discovery.

ieugen commented 3 years ago

I'm also looking into using SRV for service discovery with envoy.

descalzi commented 3 years ago

Same here, I need ip/port (SRV records) from Consul. Thanks!

FallenWarrior2k commented 3 years ago

Any particular reason you want to use SRV records for that?

Don't get me wrong, I definitely want to see support for SRV records myself, but Consul and Envoy have a native integration already, so I'm wondering what your reason is for not using that.

‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ On Monday, 30 November 2020 18:43, Martin notifications@github.com wrote:

Same here, I need ip/port (SRV records) from Consul. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

descalzi commented 3 years ago

Just because I think I'll be able to get the records on SRV (and non-Consul services) before I can get the necesary changes in my Consul agent for the whole sidecar thing needed ... :)

twreid commented 3 years ago

For me the main reason is because my company would rather pay AWS for infrastructure than spin up a consul cluster and AWS service discovery uses SRV records.

kylebevans commented 3 years ago

I'm considering working on this, but I'm trying to get a bearing on what the desired direction is at this point. From poring over #517, #6379, #1477, #1835, #4290, #5186, #10361 and the code itself, it seems like the desire is to make the custom resolver interface asynchronous and then possibly hook the current dns resolver into it so that EDS clusters can use dns resolution. Am I reading that correctly? If so, do you still want to keep the custom resolver interface coupled to the dispatcher and use continuation passing/callbacks to achieve async?

The previous discussions seem to point to adding SRV support as a custom resolver once async resolvers are implemented. But I have these thoughts and questions about it:

No rush, I know the project is on break.

mattklein123 commented 3 years ago

I need to spend some time and page all of this back in. I will get back to you next week some time. I would like to see this one happen. Thank you for your interest!

mattklein123 commented 3 years ago

@kylebevans I'm trying to approach this from first principles since this issue has been opened for so long and so many changes have taken place over the years. Here are my general thoughts: 1) Given that SRV records start with '_' and other records don't, it seems reasonable to do SRV resolution in that case. Stepping back though, I'm wondering if we don't need to worry about this at all? Meaning, at the DNS level, if we ask c-ares to do a resolution, can we potentially have a different callback if the returned record(s) happen to be SRV records? Would that be simpler? 2) At this point static_dns already uses LocalityLbEndpoint style configuration. Could we actually just implement SRV resolution directly in the existing static_dns cluster? Meaning, if we happen to get SRV records we handle them, if we don't, we just do what we do today? 3) Do SRV records always return host names vs. IP addresses? Meaning, there will always be a double resolution involved? If so, it might not be worth it to use static_dns. At this point we allow custom cluster types. Would it be better to implement a new extension cluster called srv_dns that is loosely based on static_dns but is custom built for SRV and then also handles the double resolution? 4) > I think config.endpoint.v3.LbEndpoint should also have a priority field added to map to a SRV priority and it could work the same way as a locality priority. What do you think?

I think this would be a pretty big change and I'm not sure it's worth it right now. Can we skip supporting priority in SRV until someone actually needs it? I'm not sure how commonly used it is. If we do end up needing it, I think the way to handle it would still be at the locality level, where we basically create localities for each distinct priority returned in the record set. This might be another data point in favor of a custom cluster type which could eventually do this.

5) > Should SRV priority and weight override configured priority and weight specified on a config.endpoint.v3.LbEndpoint? My opinion is that config should take precedence since someone setting those fields with a SRV address is probably doing it on purpose and wants to override SRV.

Modulo skipping priority in the MVP I agree.

6) > There has been discussion about moving port handling into the dns interface, but the clusters still have to construct Host objects. Would it make it easier on the clusters for resolvers to just return Host objects directly? Maybe make a distinction between Address resolvers and ClusterLoadAssignment resolvers?

Per above, I think what I would recommend is having the DNS interface just expose a new interface for getting SRV responses?

7) > #14310 talks about making DNS resolvers an extension. How does that affect the custom resolver interface and SRV resolution? I'm concerned since I'll be working in my spare time and probably won't be very fast if there is overlap.

My quick take on this is that I don't think we should implement SRV as an extension resolver. I think it could probably just be built into the main resolver interface and implemented using c-ares?

Excited to see this worked on. I think I would recommend a small gdoc with a final proposed design that we can work through once you have looked through things some more? Thank you!

kylebevans commented 3 years ago

For 1 & 2, extending the existing dns resolver interface was the direction taken in venilnoronha's PR at #6379. During code review, a few things came up that seemed to pivot it to using the custom resolver interface instead:

I'm not 100% sure, but I think extending the existing dns resolver interface and also keeping the clusters agnostic is still possible. We could change the dns resolver interface to provide a data structure that would support either SRV or a regular dns request. Right now it provides a list of DnsResponses which the clusters use to create a HostVector. Inside a DnsResponse, only an address and TTL are supported which won't work for SRV.

So to sum up the big decisions:

Moving on to the other points:

  1. The RFC says it must return address records, but I did just try to create a SRV record with IP addresses in a Windows DNS server and it worked. I think we should plan on SRV returning either address records or IP addresses. I think in the normal case, it will return address records and a double resolution will be necessary.

  2. Sounds good to me.

  3. Nice

  4. I think I originally suggested this as a way to achieve agnostic clusters, so it depends on 1 & 2.

  5. I would be happy to create a google doc. Do you already have a Google organization that you keep these kind of docs in? If so, could you create the doc and invite me as an editor? The reason I ask is that Google won't allow moving a doc from one org to another, so if I create it myself then it will forever be separate from your other docs, which has happened to me in the past. If that's not an issue then I will just create the doc.

mattklein123 commented 3 years ago

At this point I think here is my rough thinking: 1) Add SRV support to the existing resolver. For simplicity I think it's probably fine to have an API resolveSRV() which is independent from the other resolution API. We can potentially merge them later if needed. 2) Given this discussion, my feeling is that we should add a new extension cluster called srv_dns. We can factor out code from static_dns and share it if we want, but I think there are enough differences that it should be independent. My thought on how this cluster would work is it should assume that the cluster load assignment is empty, and have its own configuration for an SRV name (or maybe a list of names). Given the response, we could eventually synthesize a CLA with multiple localities (using priority), but in the beginning for the MVP we could have a single locality and just adhere to the weight field. I don't think we need to use the EDS code at all.

I think ^ is pretty straightforward and should get us to where we need to be. I honestly haven't gone back through all the old issues and discussions but given the extension points we have today this makes the most sense to me. In terms of writing up a more fully formed version of this in a gdoc, we don't currently have a hosted drive (we probably should), so for now a doc from your account that is world commentable would be great. Thank you!

kylebevans commented 3 years ago

This makes sense to me. I'll start on the doc, thanks.

kylebevans commented 3 years ago

Hi, here is the design doc: https://docs.google.com/document/d/11yEE9qk_xr8bvdQLPdEuzBM9LpuXRUseNgQIf4y20E0/edit?usp=sharing

I'll await your feedback, but again there is no rush.

mattklein123 commented 3 years ago

Thanks @kylebevans. Very exciting to see this being worked on. I left some comments on the doc but it definitely looks like it's heading in the right direction!

ntgsx92 commented 3 years ago

@kylebevans Hey Kyle, we're currently looking at using Envoy for internal routing which requires adding SRV record support in Envoy. I'm wondering how far have you gone in terms of implementation? I'm also happy to work with you to get a working PR started.

ntgsx92 commented 3 years ago

@mattklein123 It seems like part of the current implementation is to add a new cluster for resolving SRV DNS query. We're currently trying to use Dynamic Forward Proxy filter and would love to use SRV query as well for resolving upstream addresses. It doesn't seem like this issue will fully address this feature request as DFP filter uses its own custom cluster instead.

Does it make sense to create a new custom cluster that resolves SRV query and can cache the results? What's your take in terms of implementation?

mattklein123 commented 3 years ago

In https://github.com/envoyproxy/envoy/issues/125#issuecomment-753748255 I suggest a 2 part change. Part 1 is adding SRV support to the resolver. This part could be configured/used from the DFP code.

kylebevans commented 3 years ago

Hi everybody,

I'm making slow, but steady progress, and thanks for being patient. Part of this is synthesizing a cluster load assignment from SRV that I'm hoping will make it easier for other clusters to use, but it looks like DFP doesn't use a CLA. Like Matt said though, the resolver will support SRV lookups. I'm also working with the c-ares team to support getting the TTL from a SRV lookup, and I have an open PR there that this is kind of dependent on.

jcreixell commented 3 years ago

@kylebevans is there any progress on this one? we at SoundCloud are considering envoy and use SRV for service discovery. We already contributed some code for SRV SD to linkerd 1.x and would be happy to help out / provide feedback if needed

kylebevans commented 2 years ago

I started a new job in May, and I confess I haven't had a lot of time to work on this. I was kind of stuck on the part where I need to convert the resolve API from returning a raw pointer to an ActiveDnsQuery to using a unique_ptr. Changing that means carefully thinking through the lifetime, but it also means the changes are going to affect all DNS queries and not just SRV queries which is scary. I'm pretty sure I know what needs to be done at this point, but it will take a lot of thinking and care.

I'll try to block out some time soon to work on it. I am still committed to finishing it.

jcreixell commented 2 years ago

thanks @kylebevans, please keep us posted :)

SuitespaceDev commented 1 year ago

Anything in particular needing help with this?? Would be very handy for our platform framework.

chobits commented 3 months ago

Hi everyone, I searched through the Envoy documentation and also found this open issue. From what I understand, Envoy's DNS resolution currently does not support SRV resolution, is that right?

I also came across an interesting DNS filter that is currently under development. According to the documentation for the DNS filter, it allows us to configure SRV records. Will these configured SRV records be used solely as fake DNS records for internal use?

mikhainin commented 1 month ago

Hi everybody,

Based, on the design doc and the great job done by @kylebevans, I've opened #35160. Basically, it introduces a new type of cluster (which is configured by an SRV record). I would appreciate if someone can provide some feedback and guide to the following steps