aws-samples / eks-anywhere-addons

https://aws-samples.github.io/eks-anywhere-addons/
MIT No Attribution
20 stars 40 forks source link

adding NEWR helm installation #59

Closed anshrma closed 1 year ago

anshrma commented 1 year ago

Issue #, if available:

Description of changes:

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

anshrma commented 1 year ago

hi @elamaran11 - Adding the TestJob.

elamaran11 commented 1 year ago

@shapirov103 The installation works fine all 4 variants of EKSA vSphere, BareMetal, Snow and Nutanix. Test Job works fine in vSphere and BareMetal but fails in Snow and Nutanix. But i can see all data of all 4 clusters in NewRelic console. We should be good to pass this PR.

image
shapirov103 commented 1 year ago

@elamaran11 if the test jobs fails in a couple of environments, then approval will mean that New Relic should be marked as FAIL for environments where it fails, and SUCCESS where it works. We have no capacity to manually validate UI across environments, the whole point is to have the full flow automated. Let's also agree on continuity of this contribution: who, when and how it will be maintained. Since this is not a point in time validation (must be continuous) we expect partners to support their products, including version upgrades, configuration upgrades, responses to issues (if any).

elamaran11 commented 1 year ago

Agreed @shapirov103 @anshrma Please check on the feedback from Mikhail. We need this to work on all envs.

elamaran11 commented 1 year ago

@shapirov103 Now it works in Snow and Nutanix after introducting 60s sleep in the test-container. The problem is slowness in admission web hook from control plane side. Im good with this now. Please check from your end.

anshrma commented 1 year ago

@elamaran11 -

1) Clustername - This need not fixed or addressed now as the tests look for a pod name directly and the intention is to have this in place via automation not by UI. 2) Tightened the SA used by the test job with the latest commit.

elamaran11 commented 1 year ago
  1. Clustername @anshrma Please remove this field if we done need it for conformance test job.
elamaran11 commented 1 year ago

Hi @anshrma The installation now works fine for vSphere and BareMetal installation but fails for Local Clusters and Snow. Remember in Local Clusters CP in NotReady by design. Please see the details from Local Clusters :

ubuntu@ip-10-0-0-178:~/eks-anywhere-conformance-testing$ kubectl get all -n newrelic 
NAME                                                            READY   STATUS              RESTARTS       AGE
pod/newrelic-newrelic-kube-state-metrics-78f4849688-q5j96       1/1     Running             0              15m
pod/newrelic-newrelic-nri-kube-events-6bf76bddf7-7zkcz          2/2     Running             0              15m
pod/newrelic-newrelic-nri-metadata-injection-6b7b47669d-88m8g   0/1     Pending             0              15m
pod/newrelic-newrelic-nrk8s-controlplane-4k9pm                  1/2     CrashLoopBackOff    12 (38s ago)   15m
pod/newrelic-newrelic-nrk8s-controlplane-mqwfq                  1/2     CrashLoopBackOff    12 (62s ago)   15m
pod/newrelic-newrelic-nrk8s-controlplane-qcf78                  1/2     CrashLoopBackOff    12 (56s ago)   15m
pod/newrelic-newrelic-nrk8s-ksm-85d6bfc46b-wqdng                0/2     ContainerCreating   0              15m
pod/newrelic-newrelic-nrk8s-kubelet-2svq7                       0/2     ContainerCreating   0              15m
pod/newrelic-newrelic-nrk8s-kubelet-5k5r5                       0/2     ContainerCreating   0              15m
pod/newrelic-newrelic-nrk8s-kubelet-62kjv                       0/2     Pending             0              15m
pod/newrelic-newrelic-nrk8s-kubelet-cs7nq                       0/2     Pending             0              15m
pod/newrelic-newrelic-nrk8s-kubelet-lg2wd                       0/2     ContainerCreating   0              15m
pod/newrelic-newrelic-nrk8s-kubelet-nhjkk                       0/2     Pending             0              15m
pod/newrelic-testjob-001-9rfcp                                  1/1     Running             0              2m51s
pod/newrelic-testjob-001-cs5m8                                  0/1     Error               0              9m34s

NAME                                               TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/newrelic-newrelic-kube-state-metrics       ClusterIP   172.20.134.13   <none>        8080/TCP   15m
service/newrelic-newrelic-nri-metadata-injection   ClusterIP   172.20.153.12   <none>        443/TCP    15m

NAME                                                  DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/newrelic-newrelic-nrk8s-controlplane   3         3         0       3            0           <none>          15m
daemonset.apps/newrelic-newrelic-nrk8s-kubelet        6         6         0       6            0           <none>          15m

NAME                                                       READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/newrelic-newrelic-kube-state-metrics       1/1     1            1           15m
deployment.apps/newrelic-newrelic-nri-kube-events          1/1     1            1           15m
deployment.apps/newrelic-newrelic-nri-metadata-injection   0/1     1            0           15m
deployment.apps/newrelic-newrelic-nrk8s-ksm                0/1     1            0           15m

NAME                                                                  DESIRED   CURRENT   READY   AGE
replicaset.apps/newrelic-newrelic-kube-state-metrics-78f4849688       1         1         1       15m
replicaset.apps/newrelic-newrelic-nri-kube-events-6bf76bddf7          1         1         1       15m
replicaset.apps/newrelic-newrelic-nri-metadata-injection-6b7b47669d   1         1         0       15m
replicaset.apps/newrelic-newrelic-nrk8s-ksm-85d6bfc46b                1         1         0       15m

NAME                             SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/newrelic-testjob   10 10 * * *   False     0        <none>          11m

NAME                             COMPLETIONS   DURATION   AGE
job.batch/newrelic-testjob-001   0/1           9m34s      9m34s

ubuntu@ip-10-0-0-178:~/eks-anywhere-conformance-testing$ k logs newrelic-newrelic-nrk8s-controlplane-4k9pm -n newrelic 
Defaulted container "controlplane" out of: controlplane, forwarder
time="2023-04-18T21:11:08Z" level=info msg="Waiting for agent container to be ready..."
time="2023-04-18T21:12:39Z" level=error msg="creating integration wrapper: applying option: timeout waiting for agent: probe timed out after 1m30s"
ubuntu@ip-10-0-0-178:~/eks-anywhere-conformance-testing$ k logs newrelic-testjob-001-cs5m8 -n newrelic 
Defaulted container "test-container" out of: test-container, kubectl (init)
Cloning into 'newrelic-integration-e2e-action'...
go: downloading github.com/sirupsen/logrus v1.9.0
go: downloading github.com/newrelic/newrelic-client-go v0.91.0
go: downloading gopkg.in/yaml.v3 v3.0.1
go: downloading golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8
go: downloading github.com/imdario/mergo v0.3.12
go: downloading github.com/google/go-querystring v1.1.0
go: downloading github.com/hashicorp/go-retryablehttp v0.7.0
go: downloading github.com/tomnomnom/linkheader v0.0.0-20180905144013-02ca5825eb80
go: downloading github.com/valyala/fastjson v1.6.3
go: downloading github.com/hashicorp/go-cleanhttp v0.5.1
go: downloading github.com/stretchr/testify v1.8.0
go: downloading github.com/davecgh/go-spew v1.1.1
go: downloading github.com/pmezard/go-difflib v1.0.0
time="2023-04-18T21:08:59Z" level=info msg="running e2e"
time="2023-04-18T21:08:59Z" level=debug msg="parsing the content of the spec file"
time="2023-04-18T21:08:59Z" level=debug msg="return with settings"
time="2023-04-18T21:08:59Z" level=debug msg="validating the spec definition"
time="2023-04-18T21:08:59Z" level=debug msg="[scenario]: This scenario will verify that metrics from a k8s Cluster are correctly collected without privileges\n, [Tag]: newrelic-testjob-001-cs5m8"
time="2023-04-18T21:09:00Z" level=warning msg="Error detected" iteration=0
time="2023-04-18T21:09:00Z" level=error msg="querying: query did not return a valid result: SELECT latest(k8s.pod.startTime) FROM Metric SINCE 5 MINUTES AGO WHERE k8s.podName = 'newrelic-testjob-001-cs5m8'"
time="2023-04-18T21:09:05Z" level=fatal msg="after 1 attempts, last errors: [querying: query did not return a valid result: SELECT latest(k8s.pod.startTime) FROM Metric SINCE 5 MINUTES AGO WHERE k8s.podName = 'newrelic-testjob-001-cs5m8']"
exit status 1

Please see below on Snow. Installation is not working and also job fails :

ubuntu@ip-34-223-14-199:~/eks-anywhere-conformance-testing$ kubectl logs newrelic-testjob-001-4pqzl -n newrelic
Error from server (BadRequest): container "test-container" in pod "newrelic-testjob-001-4pqzl" is waiting to start: PodInitializing
ubuntu@ip-34-223-14-199:~/eks-anywhere-conformance-testing$ kubectl get all -n newrelic
NAME                                                                  READY   STATUS        RESTARTS   AGE
pod/newrelic-newrelic-nri-metadata-injection-admission-create-srckw   0/1     Completed     0          59m
pod/newrelic-testjob-001-4pqzl                                        0/1     Terminating   0          30m

NAME                             SCHEDULE      SUSPEND   ACTIVE   LAST SCHEDULE   AGE
cronjob.batch/newrelic-testjob   10 10 * * *   False     0        <none>          33m

NAME                                                                  COMPLETIONS   DURATION   AGE
job.batch/newrelic-newrelic-nri-metadata-injection-admission-create   1/1           34m        60m
job.batch/newrelic-testjob-001                                        0/1           30m        30m
elamaran11 commented 1 year ago

Good News though is the solution is validated on vSphere and BareMetal.

elamaran11 commented 1 year ago

Hi @anshrma After working on our environment a little now, we can confirm that EKS-A Conformance framework works fine on EKS-A on SnowBall, please see your test job results below :

I would recommend a next step to do the following :

  1. Submit the NewR product only in vSphere, BM and SnowBall folders in addons repo vs in common repo so we can confirm validated for NewR in Addons Validated partners folder. This will move to service team docs soon
  2. Please have NewR Partner to update a comment in your PR to say they will take care of ongoing maintenance. Once above is done, we will do another round and we can merge. Looking forward.
[ec2-user@ip-34-223-14-193 eksa-conformance-snow]$ kubectl logs newrelic-testjob-01-f7k5s -n newrelic -f
Cloning into 'newrelic-integration-e2e-action'...
go: downloading github.com/sirupsen/logrus v1.9.0
go: downloading github.com/newrelic/newrelic-client-go v0.91.0
go: downloading gopkg.in/yaml.v3 v3.0.1
go: downloading golang.org/x/sys v0.0.0-20220715151400-c0bba94af5f8
go: downloading github.com/imdario/mergo v0.3.12
go: downloading github.com/hashicorp/go-retryablehttp v0.7.0
go: downloading github.com/google/go-querystring v1.1.0
go: downloading github.com/tomnomnom/linkheader v0.0.0-20180905144013-02ca5825eb80
go: downloading github.com/valyala/fastjson v1.6.3
go: downloading github.com/stretchr/testify v1.8.0
go: downloading github.com/hashicorp/go-cleanhttp v0.5.1
go: downloading github.com/davecgh/go-spew v1.1.1
go: downloading github.com/pmezard/go-difflib v1.0.0
time="2023-05-19T19:44:14Z" level=info msg="running e2e"
time="2023-05-19T19:44:14Z" level=debug msg="parsing the content of the spec file"
time="2023-05-19T19:44:14Z" level=debug msg="return with settings"
time="2023-05-19T19:44:14Z" level=debug msg="validating the spec definition"
time="2023-05-19T19:44:14Z" level=debug msg="[scenario]: This scenario will verify that metrics from a k8s Cluster are correctly collected without privileges\n, [Tag]: newrelic-testjob-01-f7k5s"
time="2023-05-19T19:44:14Z" level=info msg="execution completed successfully!"
anshrma commented 1 year ago

@elamaran11 - For outpost (mentioned as local cluster in this thread,

elamaran11 commented 1 year ago

@anshrma Conformance framework has successfully validated NewRelic for three deployment models namely - EKS-A on BM, EKS-A on Snow and EKS-A on vSphere. One pre-req that is required to merge the PR is NEWR Commenting on the PR to say they can support ongoing maintenance with previously shared verbiage.

elamaran11 commented 1 year ago

GH Action failure can be ignored as it was from an experiment. We have disabled the GH actions now for future PRs until we have the experiment fully working. Approving and Merging this.