cncf / cluster

🖥🖥🖥🖥CNCF Community Cluster
https://cncf.io/cluster
152 stars 42 forks source link

Kata Containers metrics CI Jenkins slave request #83

Open grahamwhaley opened 5 years ago

grahamwhaley commented 5 years ago

Please fill out the details below to file a request for access to the CNCF Community Infrastructure Lab. Please note that access is targeted to people working on specific open source projects; this is not designed just to get your feet wet. The most important answer is the URL of the project you'll be working with. If you're looking to learn Kubernetes and related technologies, please try out Katacoda.

First and Last Name

Graham Whaley

Email

graham.whaley@intel.com

Company/Organization

Intel

Job Title

Senior Software Engineer

Project Title

Kata Containers

Briefly describe the project

Open source multi architecture community collaboration to develop virtual machine based container runtimes and deliver their integration into common container infrastructures and orchestration (OCI, Docker, Kubernetes etc.)

Which members of the CNCF community and/or end-users would benefit from your work?

The obvious member is Kubernetes, who already work closely in conjunction with the Kata Containers community to ensure Kubernetes and virtual machine container runtimes are a natural and compatible fit.

Is the code that you’re going to run 100% open source? If so, what is the URL or URLs where it is located? What is your association with that project?

Yes, 100% open source and up on github: https://github.com/kata-containers

What kind of machines and how many do you expect to use (see: https://www.packet.net/bare-metal/)?

Prediction is 2x t1.small.x86 machines, running 24/7-ish. Our current jenkins CI backlog across all the repositories pretty much consumes one whole machine (and it is not compute bound).

We can start/trial with just one t1.small.x86 (for PR CI), and later add another to support master branch merge regression checking.

What OS and networking are you planning to use (see: https://help.packet.net/technical/infrastructure/supported-operating-systems)?

I would expect Ubuntu 18.04

Please state your contributions to the open source community and any other relevant initiatives

Previously having worked on a new architecture addition to the Linux kernel which eventually made it into the upstream, for the last 2+ years I have been focussed on the open source Clear Containers (https://github.com/clearcontainers), now Kata Containers.

Any other relevant details we should know about?

I expect us to tie the machines as Jenkins slaves into our existing Jenkins master at http://jenkins.katacontainers.io/, and dedicate them to metrics CI builds only.

Kata Containers is umbrella'd under the OpenStackFoundation, but is not part of the OpenStack project.

dankohn commented 5 years ago

Could you please describe what you'd like to actually do? We're open to supporting you, but would like to confirm that Intel or OpenStack infrastructure cannot meet your needs.

jacobsmith928 commented 5 years ago

@dankohn if this use case (e.g. ongoing CI infra) doesn't fit well into the CIL, we can work with @grahamwhaley separately on an arrangement.

dankohn commented 5 years ago

@grahamwhaley Thanks for the reference to https://github.com/kata-containers/ci/issues/6

I checked with @jacobsmith928 and we have a thumbs up for you to go forward.

My request (both for Community Infrastructure Lab policy and for best practice) is that you make 100% of your continuous integration code open source (other than confidential tokens, obviously).

+1

grahamwhaley commented 5 years ago

That's fantastic news @dankohn @jacobsmith928 I know you've probably gotten the details from kata-containers/ci#6, but I'll drop a summary here for the record.

For Kata Containers we have a set of metrics tests that we'd like to run in a CI to both:

Due to the nature of the majority of the tests, we need to run these in a reproducible manner (otherwise we cannot regression check or compare over time), and that thus mandates either bare metal machines or dedicated cloud servers (that support nested VMs), with no noisy neighbour effects etc.

We have struggled to find any suitable hardware so far, and hence the request here. OSF does not have access to such hardware. We plan to run this all under Jenkins. These machines will be new dedicated metrics slaves controlled by the existing kata Jenkins CI master (that is hosted under the OSF resource kindly donated by vexxhost).

All of the code and configs will be fully open sourced. Almost all of it is already open:

and we intend to publish all the Jenkins details/configs we can, like we already publish all the Jenkins configs (apart from the secrets ;-) ) for the parallel QA CI: https://github.com/kata-containers/ci/tree/master/jenkins

Thanks!

taylorwaggoner commented 5 years ago

@grahamwhaley - I have invited you to the Kata Containers project in Packet. Please let me know if you have any questions!

grahamwhaley commented 5 years ago

Thanks @taylorwaggoner I've accepted the invite, and can see the Kata Containers project within the CNCF org on packet.net. It's my first time deploying on packet.net, so I need to go do a bit of readup on how to set up the deployment and then how we tie that into Jenkins (if we have a nodepool for instance etc.). That'll take me a little bit of time. I'll post a status update here when things are up (or stuck ;-) ).

Many thanks everybody!

jacobsmith928 commented 5 years ago

@grahamwhaley definitely ping @vielmetti or me in our community slack and we can help as needed.

vielmetti commented 5 years ago

A specific issue we are tracking is JCLOUDS-1219 , and its related Github issue https://github.com/jclouds/jclouds-labs/pull/337

grahamwhaley commented 5 years ago

I just noticed on slack that the t1.small.x86 machine now come in two flavours (4 and 8 core?). Given I need repeatability of metrics tests in order for the CI to spot regressions, and the fact that right now I cannot get jenkins to deploy a packet machine on-demand via the jclouds plugin, I think the prudent way forwards is to deploy and assign a t1.small.x86 machine 24/7 to the task of Kata metrics CI. Any objections to that on resource sharing/utilisation/cost grounds etc.? @dankohn @jacobsmith928 @vielmetti

jacobsmith928 commented 5 years ago

@grahamwhaley yes that is fine.

grahamwhaley commented 5 years ago

Hi. I'm seeing more variance (noise) in the metrics results than I expected on the t1.small.x86 machine. It also seems that the machine is running Kata itself very slowly (8s to get into a container, rather than the <1.5s I was expecting). Can I request access to an x86 machine from the next tier up (I would grab the name, but I'm having difficulty getting to the packet web pages with that info right now - it might be the c1.small ?), the goals being:

Many thanks.

dankohn commented 5 years ago

Certainly. Please go ahead.

Dan Kohn dan@linuxfoundation.org Executive Director, Cloud Native Computing Foundation https://www.cncf.io +1-415-233-1000 https://www.dankohn.com

On Thu, Oct 4, 2018 at 9:29 AM Graham Whaley notifications@github.com wrote:

Hi. I'm seeing more variance (noise) in the metrics results than I expected on the t1.small.x86 machine. It also seems that the machine is running Kata itself very slowly (8s to get into a container, rather than the <1.5s I was expecting). Can I request access to an x86 machine from the next tier up (I would grab the name, but I'm having difficulty getting to the packet web pages with that info right now - it might be the c1.small ?), the goals being:

  • confirm if the variance and slowness is specific to that machine type/tier
  • If not, then debug using the new machine, whilst leaving the t1.small running the CI builds still)
  • if the problems do only manifest on the t1.small.x86, then it is likely I will request if we can move from the t1.small to the next tier.

Many thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cncf/cluster/issues/83#issuecomment-427018866, or mute the thread https://github.com/notifications/unsubscribe-auth/AC8MBshVFgm9gLc_Qlntk65lBzTCaWb-ks5uhg1GgaJpZM4WT3j_ .

jacobsmith928 commented 5 years ago

I would recommend a c2.medium.x86 (if you need more cores) or a c1.small.x86 if you just need faster cores.

lukaszgryglicki commented 5 years ago

I've found arm64 one VERY fast and cheap - but all your packages/tools need to support ARM then. It has 96 core for instance and 128G of memory.

grahamwhaley commented 5 years ago

Thanks. We don't generally need amazing speed or number of cores (I test on my desk with an i5 2/4 core NUC for instance), which is why I thought we'd be fine on the t1.small. I'll start with the c1.small and see what I find. @lukaszgryglicki - our sw stack up does support ARM (and IBM Power and Z as well as x86). I'll /cc in @kalyxin02 here for reference, who has been running up the Jenkins QA CI on ARM for Kata. Some of that, or closely related, work I believe is via Packet/WorksOnARM. Once the QA CI is stable on ARM then I'd expect we would move to looking at a metrics CI setup as well.

Thanks folks - and the speedy replies appreciated.

grahamwhaley commented 5 years ago

Update time then. We are still utilising a t1.small 24/7 for our jenkins metrics tracking (http://jenkins.katacontainers.io/computer/x86_packet01/builds), now running the jenkins jobs on the bare metal (which involves us trying to keep the machine clean after each run, which is fun). I'd move to an 'on demand' model if we had a way to deploy from Jenkins (jclouds plugin still not working for packet.net afaik). I had another t1.small up for quite some time whilst I was debugging some of the jenkins hangups we had on that machine. I've taken that down now. I've just run up a c1.small 24/7 as another jenkins slave (http://jenkins.katacontainers.io/computer/x86_packet_elk01/) to track master branch merges, with the intention of injecting the results into an ELK stack for metrics tracking of the project over time. Again, we'd do that on-demand if we had a method via Jenkins. I should also note I put together some ansible scripts to deploy the slaves in the correct configurations to run our Jenkins CI slave tasks.

grahamwhaley commented 5 years ago

Update. I'm going to move the Kata PR metrics CI slave from a t1 to a c1 instance. The results from the t1 have ended up being just too 'noisy' to make reasonable regression checks, and the c1 looks to produce much more stable results (for our Kata tests at least). For reference, a couple of examples of our memory footprint and 'boot container' measures on the two systems. Note, in this instance it is not so much the absolute figures obtained, but the repeatability between runs that matters.

footprint

time

dankohn commented 5 years ago

+1. Thanks for letting us know.

grahamwhaley commented 4 years ago

Hi. Can I request we add another Kata member to the CNCF/Kata org on packet.com so we have more than one point of failure (me :-) ) for (re-)creating instances? I'd like to suggest we add @chavafg, who is the high level owner of the Kata CI systems. Let me know if you'd like me to open a fresh Issue for this. And, continued many thanks for access to the resources :-).

taylorwaggoner commented 4 years ago

@grahamwhaley I've added salvador.fuentes@intel.com to the Kata Containers project in Packet. Thanks!

chavafg commented 4 years ago

Hi @taylorwaggoner,

I logged into Packet, but seems that I still cannot see Kata project. I only see a Create Organization option. Any idea? is there something else I should do?

Thank you!

taylorwaggoner commented 4 years ago

@chavafg I believe you should have received an invitation to that specific Packet project. You would need to click the link in the email to accept the invitation. Did you do that? Thanks!

chavafg commented 4 years ago

hmm, searching through my inbox (and junk email) I can't see it. Could you please help me re-sending the invitation? Thanks :)

taylorwaggoner commented 4 years ago

Please confirm that salvador.fuentes@intel.com is the correct email address. I tried to resend it and got an error message that it was unable to send, so I'm guessing the invitation also did not go through the first time I tried.

chavafg commented 4 years ago

yes, that is the correct email address: salvador.fuentes@intel.com

BTW, I created the packet account using that email last Friday, and remember that I had to use the re-send confirmation email as it didn't arrive the first time.

Thanks for your help.

vielmetti commented 4 years ago

@chavafg I resent the invitation at about 4:22 pm Eastern on 2020-03-30 (i.e. just now), let me know when you are in.

chavafg commented 4 years ago

@taylorwaggoner I just received the invitation and could access the Kata project. Thanks for your help :)

chavafg commented 4 years ago

@vielmetti, thanks both, I am now in :)

devimc commented 4 years ago

Hi everybody, I'm starting to play with VFIO/SRIOV in Kata Containers. I already have some VFIO tests using virtio devices, I was planning to add more VFIO tests but now with real hw (gpus, nics, etc), I was wondering if the metrics node could be upgraded to a node that supports SRIOV/VFIO with an extra NIC or GPU, this way I could use it to test VFIO/SRIOV. Thanks in advance, any comment/help would be appreciated.

chavafg commented 3 years ago

Hello,

Can I request access to another member of the Kata team to the CNCF/Kata org on packet.com? @grahamwhaley is now retired and I am the only point of contact for this, so would like to have someone else accessing the servers in case I am unavailable. I'd like to suggest we add @amshinde - archana.m.shinde@intel.com.

Thank you very much for all your support. /cc @vielmetti @taylorwaggoner

taylorwaggoner commented 3 years ago

@chavafg I've invited archana.m.shinde@intel.com to the Kata project in Packet. Thanks!

amshinde commented 3 years ago

thanks @taylorwaggoner

chavafg commented 3 years ago

thanks @taylorwaggoner :)

chavafg commented 3 years ago

Hello,

I would like to check with if you are ok with us deploying an additional c1.small.x86 server for running our metrics CI for the 2.x branch of Kata. We currently use one c1.small.x86 for the 1.x branch, but now we are in the phase of supporting 1.x and 2.x versions of Kata for a period of 6 months. For us it would be better to have another machine so we do not pollute the environments between both kata versions. After those 6 months we plan to deprecate 1.x and we will be able to shutdown one of the 2 machines.

Thanks in advance for your support. cc @taylorwaggoner @vielmetti @dankohn @jacobsmith928

vielmetti commented 3 years ago

Sounds like a good approach to me, @taylorwaggoner @idvoretskyi can you confirm?

taylorwaggoner commented 3 years ago

Sounds good to me @vielmetti

chavafg commented 3 years ago

@vielmetti @taylorwaggoner thanks for your support. I have deployed the new server.

idvoretskyi commented 2 years ago

This can be closed.

vielmetti commented 1 year ago

@chavafg @jeefy

Reopening this to handle a data center migration task.

There is a single machine currently in use in the Kata Containers project, "kata-metric6", in our SJC1 data center. That data center is closing.

We have capacity in our SV data center (Silicon Valley, same metro) available for you to set up a new system in.

Our hardware options have changed somewhat, and the legacy c1.small system you have is no longer in our current stock. I would recommend one of our m3.small systems as a likely option as an alternative.

thanks!

vielmetti commented 1 year ago

Bringing this to the attention of @gabyCT who hopefully can direct appropriately.

vielmetti commented 10 months ago

The Kata Containers data center migration per above has completed successfully.

There's one more administrative thing to do, to move this project from CNCF sponsorship to OpenInfra Foundation sponsorship. No action is necessary on your part at this time, it's fundamentally an accounting issue and not a technical issue. When the time comes I'll coordinate with the OpenInfra team to take over administration of the account. There should be no need to change any of the machines.

idvoretskyi commented 2 months ago

A kind check with @vielmetti if any progress has happened here :)

vielmetti commented 1 month ago

@idvoretskyi Meeting scheduled for later this week to discuss, thanks!