Service Mesh Performance

leecalcote commented 5 years ago

Please fill out the details below to file a request for access to the CNCF Community Infrastructure Lab. Please note that access is targeted to people working on specific open source projects; this is not designed just to get your feet wet. The most important answer is the URL of the project you'll be working with. If you're looking to learn Kubernetes and related technologies, please try out Katacoda.

First and Last Name

Lee Calcote

Email

leecalcote@gmail.com

Company/Organization

SolarWinds

Job Title

Head of Technology Strategy

Project Title (i.e., summary of what do you want to do, not what is the name of the open source project you're working with)

This is a Google Summer of Code project focused on repeatable performance benchmarks of Network Service Mesh, Linkerd, Envoy, and other service meshes - https://github.com/cncf/soc#linkerd-and-envoy.

Briefly describe the project (i.e., what is the detail of what you're planning to do with these servers?)

We are planning to run service mesh / service proxy performance tests; planning to compare apples-to-apples performance across service meshes and proxies.

Situation An engineer learns of the architecture and value provided by service meshes. Quite commonly, they are impressed and intrigued. Upon reflection, the most commonly asked question they then ask is “what overhead does being on the mesh incur?”.

Problem Anytime performance questions are to be answered, they are subjective to the specific workload and infrastructure used for measurement. Given this challenge, the Envoy project, for example, refuses to publish performance data because such tests can be 1) involved and 2) misinterpreted.

Beyond the need for performance and overhead data under a permutation of different workloads (applications) and types and sizes of infrastructure resources, the need for cross-project, apple-to-apple comparisons are also desired in order to facilitate a comparison of behavioral differences between service meshes and selection of their use. Individual projects shy from publishing test results of other, competing service meshes. An independent, unbiased, credible analysis is needed.

Is the code that you’re going to run 100% open source? If so, what is the URL or URLs where it is located? What is your association with that project?

Yes, 100% open source under Apache v2 - https://github.com/layer5io/meshery. I'm a maintainer of this project.

What kind of machines and how many do you expect to use (see: https://www.packet.com/bare-metal/)?

20 Accelerator x1.small.86

What OS and networking are you planning to use (see: https://support.packet.com/kb/articles/supported-operating-systems)?

Ubuntu 18.04

Any other relevant details we should know about?

This research project is being done in coordination with the engineering school of UT Austin. We would like to post results on the CNCF blog (assuming this is desired).

This research differs from Linkerd's recent work with Kinvolk in a number of ways, one of which is that a number more service meshes will be tested.

jacobsmith928 commented 5 years ago

Hi Lee, this is an interesting project - will let the CNCF weigh in but we (Packet) may need to coordinate on resources to ensure testing. Our x1.small is being phase out, so I'm interested if you are looking for that specific config or just a broad variety of locations?

dankohn commented 5 years ago

https://github.com/layer5/meshery is not a public repo. Please update.

leecalcote commented 5 years ago

@dankohn apologies. I included an invalid URL when typing out the address by hand. The correct link is https://github.com/layer5io/meshery.

dankohn commented 5 years ago

+1

leecalcote commented 5 years ago

@jacobsmith928 thanks. In truth, I think any config will do the trick, so long as it's uniform (all servers are the same size). Multiple locations are of interest, although, not a must-have.

@gpremsankar @edwarnickle @fkautz @mattklein123 @nicholasjackson @eveld @mandarjog @suryadu @olix0r @thehh1974, do you see this any differently? Any of your input is welcome.

jacobsmith928 commented 5 years ago

Sounds good, we can make available 20 homogeneous machines of various sizes (c1.small, c2.medium, or m2.xlarge). Please update this issue if a distributed setup across datacenters is preferred.

How long will you expect to use the environment?

taylorwaggoner commented 5 years ago

@leecalcote I've created the project and sent you an invitation to join. Please send me the email addresses of anyone else that should be added to the project. thanks!

fkautz commented 5 years ago

For NSM (and most other service meshes), I think a generic config is probably ok. NSM doesn't have any strict requirement for initial testing. Some advanced configs may require special NIC cards which support SR-IOV or DPDK (such as specific intel or mellanox cards). I think this is good to get started though. The m2.xlarge has ConnectX 4 which should be sufficient when the need arises. I don't recall seeing any ConnectX 5 cards in your general environment.

jacobsmith928 commented 5 years ago

Howdy @fkautz if you require Intel x710 NICs let us know and we'll try to arrange it via n2.xlarge or other configs which are now being added with Intel instead of Mellanox ConnectX4's. We do not yet have ConnectX5's in our public cloud systems.

sb1975 commented 4 years ago

@leecalcote I've created the project and sent you an invitation to join. Please send me the email addresses of anyone else that should be added to the project. thanks!

Hi I am interested to be involved in this Project and enhancement of the Benchmarking to include additional use case. Let me know or add me please - sudeep.batra@ericsson.com , sudeep.batra@att.com

vielmetti commented 2 years ago

@leecalcote - from the looks of it, this project was set up under the CNCF but no servers were ever provisioned. Can you help me confirm this? thanks Ed

github-actions[bot] commented 2 years ago

Checking if there are any updates on this issue

fkautz commented 2 years ago

Did we ever get results on this? It would be good to link here just for completeness.

idvoretskyi commented 2 years ago

cc @leecalcote on the above ^

leecalcote commented 2 years ago

@fkautz, thank you for following up on this.

@vielmetti, right, no, we never completed (started) this testing. We're ready to commit to doing so now, though. What's the best way to go about getting rolling? // @gyohuangxin

vielmetti commented 2 years ago

@leecalcote if you're up for a call to plan this I can probably answer a lot of questions in an interactive session faster and more accurately than a back-and-forth by text. We have a number of 3d generation configurations current as well as "coming soon" and that would help me sort out some best available options. https://calendly.com/evielmetti is my scheduling link. Once we get a spec for what we want/need I'll coordinate with @idvoretskyi from the CNCF side.

vielmetti commented 2 years ago

Good call today with @leecalcote and crew, and I think there are meeting notes though I don't have the link handy.

The architecture that I think will work is 1x small node as a Github self-hosted runner that will have a long-running setup, and then also some on-demand systems (size and scale TBD) where the project would spin up a cluster, install all of the necessary software, launch some tests, report back on those tests when they complete, and tear down the whole thing. By doing on-demand infra rather than long-running we should at Equinix be able to support relatively large short-lived tests.

As a "get it started" effort I suspect we'll also want a small cluster (3x nodes, maybe also of small size) to help bootstrap the effort and get something up that works.

leecalcote commented 2 years ago

@vielmetti's summary is spot on.

Here are links for reference: meeting minutes and recording

idvoretskyi commented 2 years ago

@leecalcote invites sent!

vielmetti commented 2 years ago

Reopening this issue to reflect a request for addtional folks on the project and to alert @hh of that.

jeefy commented 2 years ago

@vielmetti Where's the request for additional folks? Happy to knock that out!

vielmetti commented 1 year ago

Two requests for this project:

The automation process is not always automatically cleaning up idle machines. The current system count in our DA metro is 33, which is well over the expectation. If you could go a quick check and clobber anything that is actually not in use I'd appreciate it - and if the project could open up an issue to track the automation task that would be helpful too.

Second, we are undertaking a process to turn down old systems in old data centers. I see a single system in NRT1 "ty-c1-small-x86-01" which we would like you to decommission as part of our plan to exit NRT1.

thanks

cc @jeefy @leecalcote

vielmetti commented 1 year ago

This project had two accounts at Equinix ("Service Mesh Benchmarking" and "Service Mesh Performance"). I've updated this issue to reflect the project name in use, and to close out the (empty) duplicate project.

gyohuangxin commented 1 year ago

@vielmetti Thank you for your reminder.

The automation process is not always automatically cleaning up idle machines. The current system count in our DA metro is 33, which is well over the expectation. If you could go a quick check and clobber anything that is actually not in use I'd appreciate it - and if the project could open up an issue to track the automation task that would be helpful too.

Our projects has a automation task to delete machines after benchmarking tests, but minority operations may fail so some machines were idle. I will delete them immediately and create a regular reminder to check the idle machine count.

Second, we are undertaking a process to turn down old systems in old data centers. I see a single system in NRT1 "ty-c1-small-x86-01" which we would like you to decommission as part of our plan to exit NRT1.

I've been deleted this machine.

vielmetti commented 1 year ago

Thanks @gyohuangxin ! All of the idle systems are now cleared out, and the last NRT1 system is gone too. Closing this for now as done.

cncf / cluster