[RA2 Core]: build a consensus on Kubernetes cluster management and document the result

CsatariGergely commented 5 years ago

In the discussions of #383 there was a disagreement on two topics in the area of Kubernetes cluster management:

Is CaaS Manager part of NFVO and if we care about the result in RA2
Scope of one Kubernetes cluster. One VNF, group of VNF from the same vendor, group of VNF-s

The resolution of this issue shall clarify these and document the result.

ASawwaf commented 5 years ago

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that ) we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

CsatariGergely commented 5 years ago

@CsatariGergely , I like the 3 issues that you opened but we repeat our self ( sorry for that ) we need to define first CaaS , CaaS manager and CIM and role of its function, its components to be able to define its position is part from NFVO, part of MANO

From my side, we can consider CIM part of ETSI MANO , for positioning of CaaS , CaaS manager , i see this is just a stack for a conceptual

...

With this isses I just list the topics where we had disagreements in the discussion in #383 . If you think you would like to work on any of them please volunteer in a comment.

tomkivlin commented 5 years ago

My thoughts... The original CNTT scope was the interfaces/capabilities from the NFVI that are consumed by VNFs/applications, plus the interfaces/capabilities provided by VIM that are consumed by VNFMs/NFVOs. (Link)

Extending this to Kubernetes suggests, I think, the following:

The interfaces/capabilities that are consumed by VNFs/applications are now provided by the container runtime (CIS in ETSI terms) rather than the NFVI - the VNFs/applications should never be interacting with the Kubernetes cluster level directly (e.g. the control plane, k8s API server, etc.).
The interfaces/capabilities that are consumed by VNFMs and NFVOs (for when indirect mode is used for VNF management) are provided by the Kubernetes control plane (CISM in ETSI terms, I think?) plus maybe some other services such as logging/monitoring??
The interfaces/capabilities that are consumed by NFVOs (for purposes other than indirect VNF management) are only for Kubernetes cluster LCM I reckon - i.e. to the "CaaS Manager" (no ETSI equivalence) or whatever capability manages the lifecycle of the CISM.

I would suggest that number 3 in the list above is not in scope of RA2, but that 1 and 2 are. Which I think follows the proposed chapter 3/4 structure outlined by @CsatariGergely?

This would also mean that chapter 7 takes the approach of documenting the requirements on a CISM/CIS to support the automated LCM that a "CaaS Manager" might provide, rather than documenting how the LCM happens.

tomkivlin commented 5 years ago

I will try and build a similar diagram to the one I linked to above, to see if we can get agreement on the scope of RA2 - I think that will address this issue then?

tomkivlin commented 5 years ago

Here's my first stab - let me know what you think. Don't worry about the interface names, we can change as needed. https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/ch01_cntt_scope_k8s.png

FYI if you want to modify, the PowerPoint is here: https://github.com/cntt-n/CNTT/blob/tomkivlin-patch-2/doc/ref_arch/kubernetes/figures/k8s-ref-arch-figures.pptx

pgoyal01 commented 5 years ago

@tomkivlin IMHO, the Container Infrastructure Service Instance (CISI), and the VM/BM, are part of the NFVI. CISI are abstractions that host containers.

Container Infrastructure Service (CIS) provides the infrastructure resources managed by CISM; the resources being exposedusing APIs: CRI, CSI, CNI (for Kubernetes). See IFA029 Section 6.2.2.

Alternate Diagram:

tomkivlin commented 5 years ago

Hi Pankaj, CIS and CISI being part of NFVI yes I understand that, makes sense. I can't disagree enough about containers being part of NFVI though. The definition of container that we're currently using even specifies that it is a running instance of a piece of software with all dependencies. For me a container is part of the application, not the infrastructure. Would you be ok with NFVI including CISI and CIS but not the pods/containers themselves?

pgoyal01 commented 5 years ago

Tom, Thanks. CISI, as you know, is a kubernetes node that hosts the pods and containers. My attempt was to keep the diagram consistent with the ETSiI where only the VNFs and VNFCs are shown separately even though the software is executing in the VM. But looking at it from the perspective where the workload software is shown separately form the machine it is executing on, I would have to agree with you to move the pods/containers out of the NFVI. Sending you the ppt file in an email.

CsatariGergely commented 5 years ago

We have three options: pre IFA029, and the two selected alternatives of IFA029. Here is the figure what I would use for the pre IFA029:

And here are the two alternatives from IFA029: 7.2.4.2 7.2.4.4

More descriptions about this is in IFA029

pgoyal01 commented 5 years ago

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3.

7.2.4.5

tomkivlin commented 5 years ago

I don't think we should be trying to answer the question of if CISM is part of the VIM or not, that's a procurement decision not an architectural one, in my opinion. The capability is distinct, so let's draw it as such. We can then clearly delineate what is in scope or not.

CsatariGergely commented 5 years ago

@CsatariGergely There are 2 others: 7.2.4.3 and 7.2.4.5. Maybe we need a discussion session on choosing one for CNTT but keeping in mind the decision has impact on VNFM (e.g., ONAP).

7.2.4.3.

7.2.4.5

", Option 2 and 4 are determined to be excluded from the target architecture. " Option 2 if 7.2.4.3 and option 4 is 7.2.4.5

pgoyal01 commented 5 years ago

@tomkivlin we all agree that the CISM functionality is distinct from say a VIM that manages only VMs as virtual resources. But, IMHO, it is a valid architectural discussion on the distribution of capabilities to components and if an existing component capabilities needs to be enhanced or the component needs to be replaced with another, etc.

tomkivlin commented 5 years ago

@pgoyal01 good point.

I wonder if it would help to list the capabilities I think are in scope of RA2? We can then draw them into a diagram without needing to get into the discussion of whether or not k8s is part of VIM, VNFM, etc. I've deliberately used non-IFA029 terms in the table so we / others / (or just I) don't get confused as there seem to be others used in some of the diagrams you've provided above.

Component/Interface	In Scope of RA2?
Virtual or physical compute, storage and network infrastructure used by Kubernetes nodes (NFVI)	Yes
Virtual or physical management of the above infrastructure (VIM / +PIM?)	Yes
Kubernetes node OS	Yes
Kubernetes container runtime	Yes
Kubernetes worker node services (kubelet, kube-proxy)	Yes
Kubernetes configuration store (etcd)	Yes
Kubernetes master node services (API server, controller-managers, DNS, CNI, etc.)	Yes
Kubernetes master nodes (Kubernetes control plane, etcd, DNS, CNI, etc.	Yes
Multi-cluster lifecycle management capability	No*
Kubernetes objects (pods, config maps, volumes, etc.)	No**

* The operations and lifecycle management chapter will address the requirements on the cluster, of being able to perform lifecycle management, but won't document the management capability itself. ** The pods, associated containers and other Kubernetes objects are considered application constructs for the purposes of RA2 (following the definition of a container)

What do you think?

pgoyal01 commented 5 years ago

@tomkivlin Agree largely with the Components that you have listed. The question is w.r.t. the application constructs -- the Kubernetes objects. Other can chime in here but I think we need these as part of the RA-2.

Are we going to discuss multi-cluster support (not LCM)? Architectural support for both CNFs and VNFs?

petorre commented 5 years ago

@tomkivlin Agree. Generally in scope should be what is in Platform (other Kubernetes services or more functionality to build MVP PaaS) and out of scope what comes from Application(s) function and control (which should not duplicate platform functionality but beyond that don't need to be prescriptive on How).

tomkivlin commented 5 years ago

From 31.10 meeting: continue to describe in Kubernetes terms the scope / architecture before then mapping back to ETSI terms.

tomkivlin commented 5 years ago

Here is my proposal for the scope of RA2.

Regarding a couple of the contentious areas, the purpose of CNTT is to aid the consistency of infrastructure platforms to make the verification and certification of network software simpler and more efficient and so I think we need to be careful not to include what may well be in any other Kubernetes Reference Architecture but be sure it's right for CNTT RA2.

Other reasons are:

CNF Components (Pods, Volumes, ConfigMaps, etc.): The terminology in chapter 1 currently states that a container is a "lightweight and portable executable image that contains software and all of its dependencies". I would argue that because this "image" is defined / managed by the software vendor and not by the "infrastructure" provider that it is not in scope of RA2. My thinking is then that these Kubernetes objects are logical constructs and instead we should include the actual infrastructure interfaces in the scope, rather than the API objects that are used to manage them. Note - the API spec (and therefore the spec of the Pod object, for example) is in scope, as that is part of the Kubernetes Master Node Services capability.
Kubernetes Application Package Management: my preference is that this is not included in the RA2 specification, as Helm doesn't consist of any infrastructure elements and I don't believe it is analogous to OpenStack Heat (which I see more like a closed-loop controller-manager in Kubernetes - it takes a desired state Heat Orchestration Template, compares that to the observed state, calculates the delta(s) and then passes instruction to the relevant OpenStack schedulers). Helm however is not a closed loop orchestration engine in of itself - at its simplest, it is a client application that consumes the Kubernetes API and therefore is something that the software vendors might choose to use for the management of their software, or not. For me this is more analogous to a component of the NFVO or VNFM that is combining HOTs to manage complex "packages". I don't believe it aids the CNTT purpose to include this. This bullet links to #451.
Kubernetes cluster lifecycle management: As per my comment above I think that this LCM capability - the creating, updating, upgrading, deleting of Kubernetes clusters, is not a capability that is in scope of the RA2 (why I've made it dotted line) as it is not a capability that would be used in the management or execution of a CNF or any other application. It would only be used by an entity (NFVO, manual, other) to create Kubernetes clusters ready for a VNFM to consume for the running of the software.

TamasZsiros commented 5 years ago

On the Kubernetes cluster lifecycle management not being part of RA2: My understanding is that the RM describes generic infra LCM: https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

tomkivlin commented 5 years ago

I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I am tending towards this position too - I see the cluster LCM as being increasingly important (I'm aware that's a difference from my above comment!)

TamasZsiros commented 5 years ago

Leaving Helm out would just mean that the VNF Manager needs to contend with an abstraction level that is lower (K8s API vs. Helm chart) for describing what is the target state/configuration for the CNF. I would argue that this would result in the VNF Manager getting more complex (compared to what it could be), which is not aligning well to the general trend of pushing down functionality from VNFM to K8s (for example scaling).

So in my view having a package manager in the CaaS (or to be maybe more exact: support an entity in the CaaS that operates on a compact, declarative descriptor as opposed to K8s API calls) brings us closer to the "ideal world" where the VNFM is very slim or non-existent.

peterwoerndle commented 5 years ago

Good figure @tomkivlin. A few comments:

For a generic representation of the k8s masters we should include a interface between the k8s masters an the VIM to represent the Kubernetes Cloud Providers in case we run on a VM infrastructure.
I think it would be more appropriate to refer to "Cloud-Native application/VNF management" or "Kubernetes-based application/VNF management" in the top right box to indicate that this is a particular type of application/VNF management
We typically see deployments where multiple k8s clusters are present, therefore, I think in comparison to VM-type infrastructure deployments the LCM of the actual cluster gains importance and should at least be reflected with some basic requirements in RA2

tomkivlin commented 5 years ago

VNF Manager needs to contend with an abstraction level that is lower (K8s API vs. Helm chart)

This is where I disagree - the VNF Manager can still use Helm charts. A Helm chart is specific to the software (i.e. the thing the vendor, who provides the VNF Manager, is delivering). Just because we don't include Helm in this RA, doesn't mean a NF vendor can't create Helm charts and the Helm client libraries within their VNF Manager to manage the software deployment.

brings us closer to the "ideal world" where the VNFM is very slim or non-existent.

The thing is, "something" (whether it's a VNFM or NFVO, or something else) would need to have the Helm client (libraries) installed in order to translate the Helm Chart into Kubernetes API calls. I think it is this "something" that we're talking about and given the comments about multiple clusters, I think it makes sense that this "something" isn't included within each cluster (and if it is, what calls upon this "something" in the first place, the NFVO?). Perhaps a very slim generic VNFM which is essentially a Helm client?

tomkivlin commented 5 years ago

How's this look @peterwoerndle, in response to your comments?

pgoyal01 commented 5 years ago

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

TamasZsiros commented 5 years ago

"Perhaps a very slim generic VNFM which is essentially a Helm client?" @tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

tomkivlin commented 5 years ago

@TamasZsiros that would be my preference, yes.

tomkivlin commented 5 years ago

@tomkivlin Is the intent only to support CNFs or both VNFs and CNFs as is the likely scenario for the foreseeable future?

Within this RA2 it is CNFs only - it's a Kubernetes Reference Architecture. I think there is a discussion to be had within CNTT about how we want to deal with the following scenarios:

Reference Architecture to support VNFs and CNFs at the same time
Reference Architecture to support VM-based and Kubernetes-based CNFs at the same time (no particular reason why a CNF couldn't include some VM-based components)
etc.

But I think that's out of the scope of RA2.

peterwoerndle commented 5 years ago

@tomkivlin the new figure addresses my comments, thanks.

pgoyal01 commented 5 years ago

@tomkivlin Maybe we need a discussion about the scope of RA-2 at the Technical Steering Committee. I see RA-1 supporting VNFs while RA-2 supporting both VNFs and CNFs and migrating to CNFs in the future.

peterwoerndle commented 5 years ago

@pgoyal01 are you referring to a VNF in the sense of a VM-based application? Generally the term "Kubernetes-based application / VNF" would not prevent to deploy a VM-based VNF on top of RA2 as long as Kubernetes is used to manage the workload. My preference would be to start with the established container management in Kubernetes and add the support for VMs using kubevirt, virtlets, RancherVM, ... in a later revision. From a northbound interface point of view it should not make a major difference in the RA.

pgoyal01 commented 5 years ago

@peterwoerndle Agree on "..as long as Kubernetes is used to manage the workload. " Since we would be in the hybrid world (VNFs and CNFs) with VNFs dominating initially, may I suggest that we include "support for VMs using kubevirt, virtlets, RancherVM, .." from the start.

tomkivlin commented 5 years ago

@pgoyal01 @peterwoerndle I'm comfortable including the management VMs through Kubernetes in RA2, but I worry that there isn't a mature production-ready option available today that we can standardise on. Another option for the future might be the use of the Operator framework and Custom Resources (similar to Cluster API, but not just for managing Kubernetes clusters).

I also think, as I mentioned above, there needs to be a distinction between VMs managed by Kubernetes (for me, that is a CNF that uses VMs) and VNFs that use VMs. If we are suggesting that VMs are managed through the Kubernetes API for VNFs, are we suggesting Kubernetes becomes a VIM in ETSI NFV v3?? That feels like a lot of change, compared to allowing CNFs to use VMs and whatever we suggest becoming part of ETSI NFV v4 (in time)...

CsatariGergely commented 5 years ago

On the Kubernetes cluster lifecycle management not being part of RA2: My understanding is that the RM describes generic infra LCM: https://cntt-n.github.io/CNTT/doc/ref_model/chapters/chapter09.html

If we view it as CNTT's high level goal to ensure the consistency of infrastructure platforms AND for various reasons (multi-tenancy, separation, edge) we see an increased number of clusters (compared to e.g. Open Stack), then perhaps it would be better to include it. Otherwise vendors will present different NFVI stacks with differing LCM capabilities, and since the CNFs will depend on separation and multi-tenancy capabilities, they will also implicitly depend on infra LCM capabilities.

I understand this is additional complexity, and perhaps we should draw a line to what extent we describe this, but at least I would list basic capabilities expected and also discuss interfaces / APIs to comply with (e.g. Cluster API)

I do not see how the LCM of the infra is visible for a VNF. Somehow I feel that adding the LCM part is a bit too big problem domain for the first release. ..

CsatariGergely commented 5 years ago

"Perhaps a very slim generic VNFM which is essentially a Helm client?" @tomkivlin so how about suggesting an [optional] Helm v3 client in VNFM (which is typically proprietary anyway)? This way the CaaS stays clean, and a vendor can decide for or against using Helm in the VNFM?

I would not include VNFM to the RA.

tomkivlin commented 5 years ago

I would not include VNFM to the RA.

Nor would I. I think the suggestion is that we don't include Helm in this RA and instead have a statement that it is a VNFM component and up to the VNFM vendor to decide whether they include it or not.

tomkivlin commented 5 years ago

I do not see how the LCM of the infra is visible for a VNF.

Yes you're right, I'm changing my mind again back to my original position. We just need to be sure we address the points Tamas has made about multitenancy etc.

tomkivlin commented 5 years ago

I've added in "Kubernetes-based Application Artefact Storage" to cover:

Container Registry
Helm Chart Repository

tomkivlin commented 5 years ago

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

tomkivlin commented 5 years ago

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

CsatariGergely commented 5 years ago

From Technical Steering Meeting 6/11/19: VM management by Kubernetes is in scope. I will clarify in the diagram.

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

CsatariGergely commented 5 years ago

Here's the update following today's steering meeting. If there are no objections I will draft a PR updating chapter 1 based on this diagram and discussions that have been had.

To clarify, I have added an interface between the Kubernetes Master Node Services and the NFVI - this is to cover an example such as kubevirt that uses Custom Resources to interact with libvirt on nodes. I had also added "or custom controller (e.g. CRDs, operators)" in the interface between Kubernetes Master Node Services and the VIM - to cover those examples that would use a provider that communicates with a VIM, rather than a lower level hypervisor service.

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI. According to my understanding:

The control communication is between the master node and the worker node (what is needed in case of containers with CRI interface also)
hypervisor with CRI interface will communicate with the libvirt of the Kubernetes Worker Machine

tomkivlin commented 5 years ago

I would not yet add kubevirt/virlet or any similar to RA2 in this release yet. I think it is enough if we sort out containers first.

Let's add it as a header and placeholder, but agreed it's not a priority item.

tomkivlin commented 5 years ago

Even is we add hypervisors with CRI interface (what is a good name for these in general?) I think the interface is not from the master node to the NFVI. According to my understanding:

The control communication is between the master node and the worker node (what is needed in case of containers with CRI interface also)

hypervisor with CRI interface will communicate with the libvirt of the Kubernetes Worker Machine

That's one type, and you're probably right about the communication channels - I will double check and update when I raise a PR for chapter 1 (will start on that today - let's move some of this more detailed discussion to a PR).

peterwoerndle commented 5 years ago

I agree with @CsatariGergely comments with regards to the CRI. @tomkivlin having a dedicated PR on this may also help to schedule it properly for a version of the document (if we decide to not take it into the first version of RA2)

anuket-project / anuket-specifications

[RA2 Core]: build a consensus on Kubernetes cluster management and document the result #449