anuket-project / anuket-specifications

Anuket specifications
https://docs.anuket.io
123 stars 117 forks source link

[RA2 Core]: build a consensus on Kubernetes cluster management and document the result #449

Closed CsatariGergely closed 4 years ago

CsatariGergely commented 4 years ago

In the discussions of #383 there was a disagreement on two topics in the area of Kubernetes cluster management:

The resolution of this issue shall clarify these and document the result.

ASawwaf commented 4 years ago

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that ) we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

CsatariGergely commented 4 years ago

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that ) we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

With this isses I just list the topics where we had disagreements in the discussion in #383 . If you think you would like to work on any of them please volunteer in a comment.

tomkivlin commented 4 years ago

My thoughts... The original CNTT scope was the interfaces/capabilities from the NFVI that are consumed by VNFs/applications, plus the interfaces/capabilities provided by VIM that are consumed by VNFMs/NFVOs. (Link)

Extending this to Kubernetes suggests, I think, the following:

  1. The interfaces/capabilities that are consumed by VNFs/applications are now provided by the container runtime (CIS in ETSI terms) rather than the NFVI - the VNFs/applications should never be interacting with the Kubernetes cluster level directly (e.g. the control plane, k8s API server, etc.).
  2. The interfaces/capabilities that are consumed by VNFMs and NFVOs (for when indirect mode is used for VNF management) are provided by the Kubernetes control plane (CISM in ETSI terms, I think?) plus maybe some other services such as logging/monitoring??
  3. The interfaces/capabilities that are consumed by NFVOs (for purposes other than indirect VNF management) are only for Kubernetes cluster LCM I reckon - i.e. to the "CaaS Manager" (no ETSI equivalence) or whatever capability manages the lifecycle of the CISM.

I would suggest that number 3 in the list above is not in scope of RA2, but that 1 and 2 are. Which I think follows the proposed chapter 3/4 structure outlined by @CsatariGergely?

This would also mean that chapter 7 takes the approach of documenting the requirements on a CISM/CIS to support the automated LCM that a "CaaS Manager" might provide, rather than documenting how the LCM happens.

tomkivlin commented 4 years ago

I will try and build a similar diagram to the one I linked to above, to see if we can get agreement on the scope of RA2 - I think that will address this issue then?

tomkivlin commented 4 years ago

Here's my first stab - let me know what you think. Don't worry about the interface names, we can change as needed. https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/ch01_cntt_scope_k8s.png

FYI if you want to modify, the PowerPoint is here: https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/k8s-ref-arch-figures.pptx

pgoyal01 commented 4 years ago

@tomkivlin IMHO, the Container Infrastructure Service Instance (CISI), and the VM/BM, are part of the NFVI. CISI are abstractions that host containers.

Container Infrastructure Service (CIS) provides the infrastructure resources managed by CISM; the resources being exposedusing APIs: CRI, CSI, CNI (for Kubernetes). See IFA029 Section 6.2.2.

Alternate Diagram: image

tomkivlin commented 4 years ago

Hi Pankaj, CIS and CISI being part of NFVI yes I understand that, makes sense. I can't disagree enough about containers being part of NFVI though. The definition of container that we're currently using even specifies that it is a running instance of a piece of software with all dependencies. For me a container is part of the application, not the infrastructure. Would you be ok with NFVI including CISI and CIS but not the pods/containers themselves?

pgoyal01 commented 4 years ago

Tom, Thanks. CISI, as you know, is a kubernetes node that hosts the pods and containers. My attempt was to keep the diagram consistent with the ETSiI where only the VNFs and VNFCs are shown separately even though the software is executing in the VM. But looking at it from the perspective where the workload software is shown separately form the machine it is executing on, I would have to agree with you to move the pods/containers out of the NFVI. Sending you the ppt file in an email.

CsatariGergely commented 4 years ago

We have three options: pre IFA029, and the two selected alternatives of IFA029. Here is the figure what I would use for the pre IFA029: image

And here are the two alternatives from IFA029: 7.2.4.2 image 7.2.4.4 image

More descriptions about this is in IFA029

pgoyal01 commented 4 years ago

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3. image

7.2.4.5 image

tomkivlin commented 4 years ago

I don't think we should be trying to answer the question of if CISM is part of the VIM or not, that's a procurement decision not an architectural one, in my opinion. The capability is distinct, so let's draw it as such. We can then clearly delineate what is in scope or not.

CsatariGergely commented 4 years ago

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3. image

7.2.4.5 image

", Option 2 and 4 are determined to be excluded from the target architecture. " Option 2 if 7.2.4.3 and option 4 is 7.2.4.5

pgoyal01 commented 4 years ago

@tomkivlin we all agree that the CISM functionality is distinct from say a VIM that manages only VMs as virtual resources. But, IMHO, it is a valid architectural discussion on the distribution of capabilities to components and if an existing component capabilities needs to be enhanced or the component needs to be replaced with another, etc.

tomkivlin commented 4 years ago

@pgoyal01 good point.

I wonder if it would help to list the capabilities I think are in scope of RA2? We can then draw them into a diagram without needing to get into the discussion of whether or not k8s is part of VIM, VNFM, etc. I've deliberately used non-IFA029 terms in the table so we / others / (or just I) don't get confused as there seem to be others used in some of the diagrams you've provided above.

Component/Interface In Scope of RA2?
Virtual or physical compute, storage and network infrastructure used by Kubernetes nodes (NFVI) Yes
Virtual or physical management of the above infrastructure (VIM / +PIM?) Yes
Kubernetes node OS Yes
Kubernetes container runtime Yes
Kubernetes worker node services (kubelet, kube-proxy) Yes
Kubernetes configuration store (etcd) Yes
Kubernetes master node services (API server, controller-managers, DNS, CNI, etc.) Yes
Kubernetes master nodes (Kubernetes control plane, etcd, DNS, CNI, etc. Yes
Multi-cluster lifecycle management capability No*
Kubernetes objects (pods, config maps, volumes, etc.) No**

* The operations and lifecycle management chapter will address the requirements on the cluster, of being able to perform lifecycle management, but won't document the management capability itself. ** The pods, associated containers and other Kubernetes objects are considered application constructs for the purposes of RA2 (following the definition of a container)

What do you think?

pgoyal01 commented 4 years ago

@tomkivlin Agree largely with the Components that you have listed. The question is w.r.t. the application constructs -- the Kubernetes objects. Other can chime in here but I think we need these as part of the RA-2.

Are we going to discuss multi-cluster support (not LCM)? Architectural support for both CNFs and VNFs?

petorre commented 4 years ago

@tomkivlin Agree. Generally in scope should be what is in Platform (other Kubernetes services or more functionality to build MVP PaaS) and out of scope what comes from Application(s) function and control (which should not duplicate platform functionality but beyond that don't need to be prescriptive on How).

tomkivlin commented 4 years ago

From 31.10 meeting: continue to describe in Kubernetes terms the scope / architecture before then mapping back to ETSI terms.

tomkivlin commented 4 years ago

Here is my proposal for the scope of RA2.

image

Regarding a couple of the contentious areas, the purpose of CNTT is to aid the consistency of infrastructure platforms to make the verification and certification of network software simpler and more efficient and so I think we need to be careful not to include what may well be in any other Kubernetes Reference Architecture but be sure it's right for CNTT RA2.

Other reasons are:

TamasZsiros commented 4 years ago

On the Kubernetes cluster lifecycle management not being part of RA2: My understanding is that the RM describes generic infra LCM: https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

tomkivlin commented 4 years ago

I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I am tending towards this position too - I see the cluster LCM as being increasingly important (I'm aware that's a difference from my above comment!)

TamasZsiros commented 4 years ago

Leaving Helm out would just mean that the VNF Manager needs to contend with an abstraction level that is lower (K8s API vs. Helm chart) for describing what is the target state/configuration for the CNF. I would argue that this would result in the VNF Manager getting more complex (compared to what it could be), which is not aligning well to the general trend of pushing down functionality from VNFM to K8s (for example scaling).

So in my view having a package manager in the CaaS (or to be maybe more exact: support an entity in the CaaS that operates on a compact, declarative descriptor as opposed to K8s API calls) brings us closer to the "ideal world" where the VNFM is very slim or non-existent.

peterwoerndle commented 4 years ago

Good figure @tomkivlin. A few comments:

tomkivlin commented 4 years ago

VNF Manager needs to contend with an abstraction level that is lower (K8s API vs. Helm chart)

This is where I disagree - the VNF Manager can still use Helm charts. A Helm chart is specific to the software (i.e. the thing the vendor, who provides the VNF Manager, is delivering). Just because we don't include Helm in this RA, doesn't mean a NF vendor can't create Helm charts and the Helm client libraries within their VNF Manager to manage the software deployment.

brings us closer to the "ideal world" where the VNFM is very slim or non-existent.

The thing is, "something" (whether it's a VNFM or NFVO, or something else) would need to have the Helm client (libraries) installed in order to translate the Helm Chart into Kubernetes API calls. I think it is this "something" that we're talking about and given the comments about multiple clusters, I think it makes sense that this "something" isn't included within each cluster (and if it is, what calls upon this "something" in the first place, the NFVO?). Perhaps a very slim generic VNFM which is essentially a Helm client?

tomkivlin commented 4 years ago

How's this look @peterwoerndle, in response to your comments?

image

pgoyal01 commented 4 years ago

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

TamasZsiros commented 4 years ago

"Perhaps a very slim generic VNFM which is essentially a Helm client?" @tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

tomkivlin commented 4 years ago

@TamasZsiros that would be my preference, yes.

tomkivlin commented 4 years ago

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

Within this RA2 it is CNFs only - it's a Kubernetes Reference Architecture. I think there is a discussion to be had within CNTT about how we want to deal with the following scenarios:

But I think that's out of the scope of RA2.

peterwoerndle commented 4 years ago

@tomkivlin the new figure addresses my comments, thanks.

pgoyal01 commented 4 years ago

@tomkivlin Maybe we need a discussion about the scope of RA-2 at the Technical Steering Committee. I see RA-1 supporting VNFs while RA-2 supporting both VNFs and CNFs and migrating to CNFs in the future.

peterwoerndle commented 4 years ago

@pgoyal01 are you referring to a VNF in the sense of a VM-based application? Generally the term "Kubernetes-based application / VNF" would not prevent to deploy a VM-based VNF on top of RA2 as long as Kubernetes is used to manage the workload. My preference would be to start with the established container management in Kubernetes and add the support for VMs using kubevirt, virtlets, RancherVM, ... in a later revision. From a northbound interface point of view it should not make a major difference in the RA.

pgoyal01 commented 4 years ago

@peterwoerndle Agree on "..as long as Kubernetes is used to manage the workload. " Since we would be in the hybrid world (VNFs and CNFs) with VNFs dominating initially, may I suggest that we include "support for VMs using kubevirt, virtlets, RancherVM, .." from the start.

tomkivlin commented 4 years ago

@pgoyal01 @peterwoerndle I'm comfortable including the management VMs through Kubernetes in RA2, but I worry that there isn't a mature production-ready option available today that we can standardise on. Another option for the future might be the use of the Operator framework and Custom Resources (similar to Cluster API, but not just for managing Kubernetes clusters).

I also think, as I mentioned above, there needs to be a distinction between VMs managed by Kubernetes (for me, that is a CNF that uses VMs) and VNFs that use VMs. If we are suggesting that VMs are managed through the Kubernetes API for VNFs, are we suggesting Kubernetes becomes a VIM in ETSI NFV v3?? That feels like a lot of change, compared to allowing CNFs to use VMs and whatever we suggest becoming part of ETSI NFV v4 (in time)...

CsatariGergely commented 4 years ago

On the Kubernetes cluster lifecycle management not being part of RA2: My understanding is that the RM describes generic infra LCM: https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I do not see how the LCM of the infra is visible for a VNF. Somehow I feel that adding the LCM part is a bit too big problem domain for the first release. ..

CsatariGergely commented 4 years ago

"Perhaps a very slim generic VNFM which is essentially a Helm client?" @tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

I would not include VNFM to the RA.

tomkivlin commented 4 years ago

I would not include VNFM to the RA.

Nor would I. I think the suggestion is that we don't include Helm in this RA and instead have a statement that it is a VNFM component and up to the VNFM vendor to decide whether they include it or not.

tomkivlin commented 4 years ago

I do not see how the LCM of the infra is visible for a VNF.

Yes you're right, I'm changing my mind again back to my original position. We just need to be sure we address the points Tamas has made about multitenancy etc.

tomkivlin commented 4 years ago

I've added in "Kubernetes-based Application Artefact Storage" to cover:

image

tomkivlin commented 4 years ago

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

tomkivlin commented 4 years ago

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

image

CsatariGergely commented 4 years ago

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

CsatariGergely commented 4 years ago

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

image

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI. According to my understanding:

tomkivlin commented 4 years ago

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

Let's add it as a header and placeholder, but agreed it's not a priority item.

tomkivlin commented 4 years ago

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI. According to my understanding:

  • The control communication is between the master node and the worker node (what is needed in case of containers with CRI interface also)
  • hypervisor with CRI interface will communicate with the libvirt of the Kubernetes Worker Machine

That's one type, and you're probably right about the communication channels - I will double check and update when I raise a PR for chapter 1 (will start on that today - let's move some of this more detailed discussion to a PR).

peterwoerndle commented 4 years ago

I agree with @CsatariGergely comments with regards to the CRI. @tomkivlin having a dedicated PR on this may also help to schedule it properly for a version of the document (if we decide to not take it into the first version of RA2)