elastic / cloud-on-k8s

Elastic Cloud on Kubernetes
Other
51 stars 707 forks source link

1.7.0 installation doesn't work on k8s 1.16 or 1.17 #4737

Closed thbkrkr closed 3 years ago

thbkrkr commented 3 years ago

The installation of ECK 1.7.0 on Kubernetes 1.16 and 1.17 using Helm or the YAML manifests returns this error:

error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

To reproduce:

> kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.17", GitCommit:"f3abc15296f3a3f54e4ee42e830c61047b13895f", GitTreeState:"clean", BuildDate:"2021-01-13T13:21:12Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.17", GitCommit:"f3abc15296f3a3f54e4ee42e830c61047b13895f", GitTreeState:"clean", BuildDate:"2021-05-18T07:11:14Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}

> helm repo add elastic https://helm.elastic.co && helm repo update && helm install elastic-operator elastic/eck-operator -n elastic-system --create-namespace

"elastic" already exists with the same configuration, skipping
Hang tight while we grab the latest from your chart repositories...
...Successfully got an update from the "elastic" chart repository
Update Complete. ⎈Happy Helming!⎈
NAME: elastic-operator
LAST DEPLOYED: Thu Aug  5 18:05:55 2021
NAMESPACE: elastic-system
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
1. Inspect the operator logs by running the following command:
   kubectl logs -n elastic-system sts/elastic-operator

> kubectl explain es
error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

> cat <<EOF | kubectl apply -f -
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.14.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
EOF
error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

> kubectl create -f https://download.elastic.co/downloads/eck/1.7.0/crds.yaml
customresourcedefinition.apiextensions.k8s.io/agents.agent.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/apmservers.apm.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/beats.beat.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/elasticmapsservers.maps.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/elasticsearches.elasticsearch.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/enterprisesearches.enterprisesearch.k8s.elastic.co created
customresourcedefinition.apiextensions.k8s.io/kibanas.kibana.k8s.elastic.co created

> kubectl apply -f https://download.elastic.co/downloads/eck/1.7.0/operator.yaml
error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

Workarounds:

NOTE: We haven't yet found a workaround for Helm and will update the issue when we do.

This seems to be a client side issue during validation as --validate=false is a remedy.

thbkrkr commented 3 years ago

I did some more tests and although kubectl explain es fails,kubectl apply -f quickstart.yml works if kubectl <= 1.16.

travisghansen commented 3 years ago

I just tried to update from 1.5.o -> 1.7.0 and ran into this issue on a 1.19 cluster.

In my case I ended up with not being able to update all kinds of objects (totally unrelated) because of the validation problem.

idanmo commented 3 years ago

Hi @travisghansen , can you please share what's your kubectl version? (kubectl version command output)

travisghansen commented 3 years ago

Yeah, v1.21 and then whatever is bundled with argocd 2.0.5.

malcolm061990 commented 3 years ago

The same issue.

kubectl version
Client Version: version.Info{Major:"1", Minor:"22", GitVersion:"v1.22.0", GitCommit:"c2b5237ccd9c0f1d600d3072634ca66cefdf272f", GitTreeState:"clean", BuildDate:"2021-08-04T17:56:19Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"17+", GitVersion:"v1.17.17-eks-087e67", GitCommit:"087e67e479962798594218dc6d99923f410c145e", GitTreeState:"clean", BuildDate:"2021-07-31T01:39:55Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
pebrc commented 3 years ago

@travisghansen I cannot reproduce the issue on a 1.19 cluster with a v1.21 kubectl client.

pebrc commented 3 years ago

Our Elasticsearch CRD has an array type without subelements

kubectl get --raw /openapi/v2 | jq '.definitions["co.elastic.k8s.elasticsearch.v1.Elasticsearch"].properties.spec.properties.nodeSets.items.properties.volumeClaimTemplates'

 {
                "description": "VolumeClaimTemplates is a list of persistent volume claims to be used by each Pod in this NodeSet. Every claim in this list must have a matching volumeMount in one of the containers defined in the PodTemplate. Items defined here take precedence over any default claims added by the operator with the same name.",
                "type": "array",
                "x-kubernetes-preserve-unknown-fields": true
 }

Which is totally fine according to
https://github.com/kubernetes/kube-openapi/blob/1a6458611d189dc17e98a0824dc92536365efedf/pkg/util/proto/document.go#L214-L218

Note the TODO(wrong) comments. However kubectl which uses the kube-openapi library cannot deal with it as these TODOs have not been addressed.

if len(s.GetItems().GetSchema()) != 1 {
        // TODO(wrong): Items can have multiple elements. We can ignore Items then (would be incomplete), but we cannot return an error.
        // TODO(wrong): "type: array" witohut any items at all is completely valid.
        return nil, newSchemaError(path, "array should have exactly one sub-item")
    }

Kubernetes seems to have "fixed" this issue in later versions server-side by changing the data returned by the OpenAPI endpoint that kubectl uses for validation. Specifically they are pruning the type attribute from the response.

For comparison purposes the same command on K8s v1.19: kubectl get --raw /openapi/v2 | jq '.definitions["co.elastic.k8s.elasticsearch.v1.Elasticsearch"].properties.spec.properties.nodeSets.items.properties.volumeClaimTemplates'

{
  "description": "VolumeClaimTemplates is a list of persistent volume claims to be used by each Pod in this NodeSet. Every claim in this list must have a matching volumeMount in one of the containers defined in the PodTemplate. Items defined here take precedence over any default claims added by the operator with the same name.",
  "x-kubernetes-preserve-unknown-fields": true
}

I believe these are the two relevant PRs (v1.18 and v1.19) in Kubernetes.

I don't see a way at the moment to tweak our CRDs to avoid the issue other than not using structural schema but we are already offering a non-structural version as *-legacy.yaml manifests.

malcolm061990 commented 3 years ago

@travisghansen I cannot reproduce the issue on a 1.19 cluster with a v1.21 kubectl client.

Try to reproduce it on 1.17 k8s cluster. This is the minimal supported version but still supported :)

travisghansen commented 3 years ago

I don’t think it’s an issue with the crd itself but the validation webhook.

I’m not sure if it’s relevant but the clusterroles I think may all have had permissions to crds…not specific crds per-se, just the ability to read/whatever crds generally.

pebrc commented 3 years ago

don’t think it’s an issue with the crd itself but the validation webhook.

The root cause is still in the CRD. I updated my original post to add a bit more explanation of my current understanding. Tweaking the clusterroles to be less restrictive might avoid the error at installation time but you still run into it every time you want to deploy an Elasticsearch cluster or run any other kubectl operation involving that CRD (e.g. kubectl explain es)

thbkrkr commented 3 years ago

The bug only occurs on k8s 1.16 && 1.17 with kubectl > 1.16. I couldn't reproduce it with other combinations of versions.

I don't see a way at the moment to tweak our CRDs to avoid the issue other than not using structural schema but we are already offering a non-structural version as *-legacy.yaml manifests.

We could remove type: array for the volumeClaimTemplates definition in the CRDs. This makes the bug disappear. The drawback is that we loose the nice validation that we have starting k8s 1.18 thanks to the structural schema.

> kubectl apply -f quickstart.yml # where I put an object in the volumeClaimTemplates instead of an array
The Elasticsearch "quickstart" is invalid: spec.nodeSets.volumeClaimTemplates: Invalid value: "object": spec.nodeSets.volumeClaimTemplates in body must be of type array: "object"

Instead we get a more obscure error:

Error from server: error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: v1.Elasticsearch.Spec: v1.ElasticsearchSpec.NodeSets: []v1.NodeSet: v1.NodeSet.VolumeClaimTemplates: []v1.PersistentVolumeClaim: decode slice: expect [ or n, but found {, error found in #10 byte of ...|mplates":{"metadata"|..., bigger context ...|count":1,"name":"default","volumeClaimTemplates":{"metadata":{"name":"elasticsearch-data"},"spec":{"|...

The above error prevents us from adding our own validation.

For now, I see 3 options:

malcolm061990 commented 3 years ago

The bug only occurs on k8s 1.16 && 1.17 with kubectl > 1.16. I couldn't reproduce it with other combinations of versions.

I don't see a way at the moment to tweak our CRDs to avoid the issue other than not using structural schema but we are already offering a non-structural version as *-legacy.yaml manifests.

We could remove type: array for the volumeClaimTemplates definition in the CRDs. This makes the bug disappear. The drawback is that we loose the nice validation that we have starting k8s 1.18 thanks to the structural schema.

> kubectl apply -f quickstart.yml # where I put an object in the volumeClaimTemplates instead of an array
The Elasticsearch "quickstart" is invalid: spec.nodeSets.volumeClaimTemplates: Invalid value: "object": spec.nodeSets.volumeClaimTemplates in body must be of type array: "object"

Instead we get a more obscure error:

Error from server: error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: v1.Elasticsearch.Spec: v1.ElasticsearchSpec.NodeSets: []v1.NodeSet: v1.NodeSet.VolumeClaimTemplates: []v1.PersistentVolumeClaim: decode slice: expect [ or n, but found {, error found in #10 byte of ...|mplates":{"metadata"|..., bigger context ...|count":1,"name":"default","volumeClaimTemplates":{"metadata":{"name":"elasticsearch-data"},"spec":{"|...

The above error prevents us from adding our own validation.

For now, I see 3 options:

  • document the bug and the workarounds
  • generate a third version of CRDs without type: array to target only k8s 1.16 and 1.17
  • generate the CRDs without type: array to target k8s >= 1.16 with the drawback of loosing a nice validation for the volumeClaimTemplates field

Good news, thanks :) I think it’s a good idea to document the bug with workaround and to remove 1.17 version from supported versions.

travisghansen commented 3 years ago

I hit the issue with different combinations of kube/kubectl for the record. I should note, I manage the crds with argocd so when upgrade/downgrade the chart version I simultaneously upgrade/downgrade the crds.

tanxiaoning007 commented 3 years ago

I don’t think we need to add x-kubernetes-preserve-unknown-fields to volumeClaimTemplates, because we sometimes need to use local storage to deploy eck for testing.

--- crds-orig.yaml  2021-08-12 15:06:03.933995197 +0800
+++ crds.yaml   2021-08-12 15:00:34.561085304 +0800
@@ -2202,7 +2202,6 @@ spec:
                               type: object
                           type: object
                         type: array
-                        x-kubernetes-preserve-unknown-fields: true
                     required:
                       - name
                     type: object
thbkrkr commented 3 years ago

I don’t think we need to add x-kubernetes-preserve-unknown-fields to volumeClaimTemplates

I forgot this one, it might be a good option. This avoids the bug and also the obscure error that we can get with removing type: array.

Test with an object instead of an array for the volumeClaimTemplates

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.14.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
    volumeClaimTemplates:
      metadata:
        name: elasticsearch-data
      spec:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: standard

k8s 1.16 with kubectl > 1.16 (where the bug occurs)

> k version --short                                                            
Client Version: v1.17.17
Server Version: v1.16.15

# CRDs => the bug occurs
error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

# CRDs without type:array => obscure error
Error from server: error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: v1.Elasticsearch.Spec: v1.ElasticsearchSpec.NodeSets: []v1.NodeSet: v1.NodeSet.VolumeClaimTemplates: []v1.PersistentVolumeClaim: decode slice: expect [ or n, but found {, error found in #10 byte of ...|mplates":{"metadata"|..., bigger context ...|count":1,"name":"default","volumeClaimTemplates":{"metadata":{"name":"elasticsearch-data"},"spec":{"|...

# CRDs without x-kubernetes-preserve-unknown-fields:true
error: error validating "q.yml": error validating data: ValidationError(Elasticsearch.spec.nodeSets[0].volumeClaimTemplates): invalid type for co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates: got "map", expected "array"; if you choose to ignore these errors, turn validation off with --validate=false

k8s 1.19

> k version --short
Client Version: v1.20.7
Server Version: v1.19.9-gke.1900

# CRDs
The Elasticsearch "quickstart" is invalid: spec.nodeSets.volumeClaimTemplates: Invalid value: "object": spec.nodeSets.volumeClaimTemplates in body must be of type array: "object"

# CRDs without type:array => obscure error
Error from server: error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: v1.Elasticsearch.Spec: v1.ElasticsearchSpec.NodeSets: []v1.NodeSet: v1.NodeSet.VolumeClaimTemplates: []v1.PersistentVolumeClaim: decode slice: expect [ or n, but found {, error found in #10 byte of ...|mplates":{"metadata"|..., bigger context ...|count":1,"name":"default","volumeClaimTemplates":{"metadata":{"name":"elasticsearch-data"},"spec":{"|...

# CRDs without x-kubernetes-preserve-unknown-fields:true
error: error validating "q.yml": error validating data: ValidationError(Elasticsearch.spec.nodeSets[0].volumeClaimTemplates): invalid type for co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates: got "map", expected "array"; if you choose to ignore these errors, turn validation off with --validate=false

Test with an invalid field spex instead of spec

apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: quickstart
spec:
  version: 7.14.0
  nodeSets:
  - name: default
    count: 1
    config:
      node.store.allow_mmap: false
    volumeClaimTemplates:
    - metadata:
        name: elasticsearch-data
      spex:
        accessModes:
        - ReadWriteOnce
        resources:
          requests:
            storage: 5Gi
        storageClassName: standard

k8s 1.16 with kubectl > 1.16 (where the bug occurs)

> k version --short                                                            
Client Version: v1.17.17
Server Version: v1.16.15

# CRDs => the bug occurs
error: SchemaError(co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates): array should have exactly one sub-item

# CRDs without type:array
Error from server (Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.): error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.

# CRDs without x-kubernetes-preserve-unknown-fields:true
error: error validating "q.yml": error validating data: ValidationError(Elasticsearch.spec.nodeSets[0].volumeClaimTemplates[0]): unknown field "spex" in co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates; if you choose to ignore these errors, turn validation off with --validate=false

k8s 1.19

> k version --short                                                         
Client Version: v1.20.7
Server Version: v1.19.9-gke.1900

# CRDs
Error from server (Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.): error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.

# CRDs without type:array
Error from server (Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest.): error when creating "q.yml": admission webhook "elastic-es-validation-v1.k8s.elastic.co" denied the request: Elasticsearch.elasticsearch.k8s.elastic.co "quickstart" is invalid: spex: Invalid value: "spex": spex field found in the kubectl.kubernetes.io/last-applied-configuration annotation is unknown. This is often due to incorrect indentation in the manifest

# CRDs without x-kubernetes-preserve-unknown-fields:true
error: error validating "q.yml": error validating data: ValidationError(Elasticsearch.spec.nodeSets[0].volumeClaimTemplates[0]): unknown field "spex" in co.elastic.k8s.elasticsearch.v1.Elasticsearch.spec.nodeSets.volumeClaimTemplates; if you choose to ignore these errors, turn validation off with --validate=false
morningspace commented 3 years ago

I did not see this issue on Kubernetes 1.21

# kubectl version
Client Version: version.Info{Major:"1", Minor:"17", GitVersion:"v1.17.11", GitCommit:"ea5f00d93211b7c80247bf607cfa422ad6fb5347", GitTreeState:"clean", BuildDate:"2020-08-13T15:20:25Z", GoVersion:"go1.13.15", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.2", GitCommit:"092fbfbf53427de67cac1e9fa54aaa09a28371d7", GitTreeState:"clean", BuildDate:"2021-07-12T20:40:20Z", GoVersion:"go1.16.5", Compiler:"gc", Platform:"linux/amd64"}

But on Kubernetes 1.19, it can be reproduced.

barkbay commented 3 years ago

We are still in the process of evaluating a fix and a workaround. A good candidate would be to remove the x-kubernetes-preserve-unknown-fields field from the volumeClaimTemplates node.

From a user perspective it would mean:

One caveat is that once the CRD has been installed it does not seem possible to upgrade anymore, either with Helm or by using kubectl replace. A workaround is to patch manually the CRD with the following command before proceeding with the upgrade from 1.7.0:

kubectl patch crd elasticsearches.elasticsearch.k8s.elastic.co --type json -p='[{"op": "remove", "path": "/spec/versions/0/schema/openAPIV3Schema/properties/spec/properties/nodeSets/items/properties/volumeClaimTemplates/x-kubernetes-preserve-unknown-fields"}]'

barkbay commented 3 years ago

@morningspace @travisghansen The fix for K8S 1.19 is actually available since 1.19.5. Could you provide the exact version of K8S you're using (using kubectl version) please ? Thanks !

travisghansen commented 3 years ago

Interesting! thanks for doing the research on that front. I can try updating to a newer version of 1.19 and let you know how it goes.

kubectl version
Client Version: version.Info{Major:"1", Minor:"21", GitVersion:"v1.21.3", GitCommit:"ca643a4d1f7bfe34773c74f79527be4afd95bf39", GitTreeState:"clean", BuildDate:"2021-07-15T21:04:39Z", GoVersion:"go1.16.6", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"19", GitVersion:"v1.19.4", GitCommit:"d360454c9bcd1634cf4cc52d1867af5491dc9c5f", GitTreeState:"clean", BuildDate:"2020-11-11T13:09:17Z", GoVersion:"go1.15.2", Compiler:"gc", Platform:"linux/amd64"}
travisghansen commented 3 years ago

That seems to have done the trick. Let me know if you want me to run any other tests before I go ahead and move the clusters to 1.20 and 1.21 shortly after that.

As an aside, would it be possible to remove the empty properties: {} blocks from the CRDs? ArgoCD detects it as a diff as the cluster apparently wipes the fields out completely.

barkbay commented 3 years ago

As an aside, would it be possible to remove the empty properties: {} blocks from the CRDs? ArgoCD detects it as a diff as the cluster apparently wipes the fields out completely.

Unfortunately we have to insert those blocks in the Elasticsearch CRD for backward compatibility, see https://github.com/elastic/cloud-on-k8s/pull/4679

barkbay commented 3 years ago

Fixed in 1.7.1, see https://www.elastic.co/guide/en/cloud-on-k8s/current/release-highlights-1.7.1.html