Prometheus 2.0 scalability experiment

brancz commented 6 years ago

First Name

Frederic

Last Name

Branczyk

Email

frederic.branczyk@coreos.com

Company/Organization

CoreOS // Prometheus team member

Job Title

Software Engineer

Project Title

Prometheus

Briefly describe the project

Prometheus is an open source monitoring solution and the 2nd project to join the CNCF. https://prometheus.io/

Which members of the CNCF community and/or end-users would benefit from your work?

Prometheus and Kubernetes users

Is the code that you’re going to run 100% open source? If so, what is the URL or URLs where it is located?

Yes 100% open source. Multiple projects under the Prometheus organization are intended to be deployed on a Kubernetes cluster across the nodes.

What kind of machines and how many do you expect to use (see: https://www.packet.net/bare-metal/)?

We would like to test for up to 1000 Kubernetes nodes (maybe the type 0 machines), with two couple of high memory nodes (type 2) to run Prometheus 2.0 on.

What OS and networking are you planning to use (see: https://help.packet.net/technical/infrastructure/supported-operating-systems)?

Container linux by CoreOS

Please state your contributions to the open source community and any other relevant initiatives

Our primary goal of this experiment is to answer these questions:

Determine maximum practical capacity for a single Prometheus 2.0 server on modern hardware
Find out what the minimum setup for x number of Kubernetes nodes should be
Find out if Prometheus can handle a load of 1000 nodes'

More exact details of the experiment we are intending to execute can be found in this spreadsheet: https://docs.google.com/spreadsheets/d/1PazPI-ftZrONhmrXZtBdUuYqk4b4JodPpdL2E8tKt7o/edit?usp=sharing

How will this testing advance cloud native computing (specifically containerization, orchestration, microservices or some combination).

This will bring clarity about the resource requirements that Prometheus 2.0 has for a given number of collected samples per second and more specifically give recommendations for running on Kubernetes. Resource planning for Prometheus is a common pain of operating Prometheus and as of today there is no guidance on this for users. We are going to publish the results once the experiments are completed.

Any other relevant details we should know about while preparing the infrastructure?

In a meeting between the Prometheus team and Chris Aniszczyk from the CNCF we were told, that a Kubernetes cluster could be provisioned for us, which would obviously be preferred if we don't have to take care of that setup (in which case choose the OS you prefer if container linux is not an already automated option, but the meltdown vulnerability is already fixed on all update channels :wink: ). We would be happy with having you provision Kubernetes on the nodes.

/cc @caniszczyk @gouthamve @brian-brazil @juliusv @superq @tomwilkie @fabxc

dankohn commented 6 years ago

+1

juliusv commented 6 years ago

:+1: This will be very useful, thanks!

taylorwaggoner commented 6 years ago

brancz, I've set this project up in Packet. Thank you!

vielmetti commented 6 years ago

Thanks! I've scheduled a call to talk to @brancz

krasi-georgiev commented 6 years ago

@brancz is catching up with other tasks so I am trying to get this going. I want to deploy the k8s cluster using https://github.com/crosscloudci/cross-cloud and try to add some CI tests which will then use to open a PR against https://github.com/crosscloudci/crosscloudci

I am krasi on #prometheus-dev or let me know where I can ping someone to get access.

dankohn commented 6 years ago

@taylorwaggoner could you please connect him about providing these resources.

taylorwaggoner commented 6 years ago

@krasi-georgiev - if you could please provide your email address, I will send you an invite to Packet. Thanks!

taylorwaggoner commented 6 years ago

@krasi-georgiev - I sent you an invitation from Packet to the Prometheus 2.0 project. Thanks!

krasi-georgiev commented 6 years ago

got it thanks will try the CNCF k8s deployment tomorrow.

krasi-georgiev commented 6 years ago

@taylorwaggoner first attempt to dpeloy the k8s cluster using https://github.com/crosscloudci/cross-cloud , and it is failing as it seems that we also need an account with https://dnsimple.com/ for the name resolutions.

-e PACKET_AUTH_TOKEN=secret
-e TF_VAR_packet_project_id=secret
-e DNSIMPLE_TOKEN=secret
-e DNSIMPLE_ACCOUNT=secret

is this something that can also be provided by the CNCF?

I also opened an issue to see if this can be avoided or use another provider that offers free account for FOSS

https://github.com/crosscloudci/cross-cloud/pull/123

krasi-georgiev commented 6 years ago

you can ignore this request as it seems that have moved away from DNSIMPLE and the README is out of date.

pengjiang80 commented 6 years ago

Any report for the experiment?

brancz commented 6 years ago

Due to some circumstances (CoreOS acquisition) the actual execution of this experiment has been postponed for a bit, but the majority of automation is done. When we're actually done, we will publish all results and announce it publicly. Sorry for the delay.

pengjiang80 commented 6 years ago

@brancz Thanks for quick reply. Looking forward to the final result.

pengjiang80 commented 5 years ago

@brancz Is there any update for this scalability experiment? Thanks a lot.

brancz commented 5 years ago

@pengjiang80 things went out of hand and we never ended up actually running this exact experiment, but within Red Hat we did similar experiments and documented our findings in the Prometheus capacity planning document: https://docs.openshift.com/container-platform/3.11/scaling_performance/scaling_cluster_monitoring.html#cluster-monitoring-capacity-planning

pengjiang80 commented 5 years ago

@brancz Thanks for the information.

vielmetti commented 3 years ago

I've torn down the project associated with this request, as the task was completed.

cncf / cluster