cerberus - monitor netapp access

ShellyXueHan commented 3 years ago

Describe the issue We want to monitor on netapp access according to https://miro.com/app/board/o9J_kgyjm_k=/

Cerberus can have custom checks setup, so we need to know what endpoint is available from netapp, and what are the SLIs.

Definition of done

[x] create custom checks in cerberus
[x] push to prod
[x] trident access via CCM
[x] CCM changes pushed to prod

ShellyXueHan commented 3 years ago

Update:

There is no open URl from netapp that we can make use of directly. @tmorik is going to figure out more details from Trident.

tmorik commented 3 years ago

I have confirmed with NetApp Team that there no external URL to check (only internal).

However, the netapp status metrics such as Aggregate Size or Aggregate_Percentage for each clusters by Nagios, and have been sent to SysDiag already, And if Nagios failed to get those metrics, Platform-ops will get a notification.

tmorik commented 3 years ago

As par conversation with Shelly, Up/Down status of NetApp will be added to the collected metrics in Nagios (and its notification) .

ShellyXueHan commented 3 years ago

Thanks @tmorik! I would also need an open endpoint (maybe from Nagios API?) where I can query the status of the NetApp storage service directly.

tmorik commented 3 years ago

I'll ask about an endpoint & API Key to Nagios monitoring team as I'm not sure if they open up an API access to the eternal clients.

tmorik commented 3 years ago

@ShellyXueHan, Talked with Monitoring team, unfortunately, Nagios API is not opened for the external clients yet. They are preparing a gateway app so called "Nagios Fusion" for this purpose, but it's still in the testing phase and not in production yet.

Anyhow, for now we need to look for another option...

Dose it work if we can create a metrics csv file somewhere, or pushing metrics to your app (cerberus?) or some kind of a db? I'm just throwing ideas without knowledge of your app.

ShellyXueHan commented 3 years ago

@tmorik sorry I missed your message here... but thanks for the effort and info!

So Cerberus have this feature to create custom python code to do tasks, such as making HTTP requests, running oc commands, etc. It doesn't really need to know the whole netapp metrics, as long as there's somewhere that can host up a signal for the storage service status, then all good! Let's have a chat on this when you are back!

ShellyXueHan commented 3 years ago

Update:

So far what we've found out is that there is no accessible status endpoint for netapp storage service that Cerberus can use.

And here's another idea - create a custom check from Cerberus that spins up a pod and mount a PVC to it. What do you think @wmhutchison @mitovskaol ? Chatted with @tmorik about this just now. If you guys think it reasonable, we'll go ahead and try it on :)

Here are some detailed steps for the Cerberus custom check:

run oc commands to create a pod/job and PVC
mount PVC to the pod
check if mounting successful
kill pod and delete PVC

@tmorik mentioned there's a similar process that we run in OCP3. He'll find the code base and one of us can test it on a lab cluster!

tmorik commented 3 years ago

I have checked some of docs and asked the questions internally.
For checking NetApp/Trident status for each cluster would be simply checking pods status in openshift-bcgov-trident namespace like below:

[root@mcs-klab-util]# oc -n openshift-bcgov-trident get pods -o wide 
NAME                                READY   STATUS    RESTARTS   AGE   IP              NODE                     NOMINATED NODE   READINESS GATES
trident-csi-24thc                   2/2     Running   0          81d   142.34.194.5    mcs-klab-master-01.dmz   <none>           <none>
trident-csi-4djvn                   2/2     Running   0          81d   142.34.194.9    mcs-klab-infra-02.dmz    <none>           <none>
trident-csi-676c588cfc-n4prk        6/6     Running   0          74d   10.97.8.8       mcs-klab-infra-01.dmz    <none>           <none>
trident-csi-7xm2w                   2/2     Running   0          81d   142.34.194.14   mcs-klab-app-04.dmz      <none>           <none>
trident-csi-df8s5                   2/2     Running   0          81d   142.34.194.13   mcs-klab-app-03.dmz      <none>           <none>
trident-csi-h6lzn                   2/2     Running   0          81d   142.34.194.10   mcs-klab-infra-03.dmz    <none>           <none>
trident-csi-hcwzg                   2/2     Running   0          81d   142.34.194.7    mcs-klab-master-03.dmz   <none>           <none>
trident-csi-kn75w                   2/2     Running   0          81d   142.34.194.8    mcs-klab-infra-01.dmz    <none>           <none>
trident-csi-p52jw                   2/2     Running   0          81d   142.34.194.12   mcs-klab-app-02.dmz      <none>           <none>
trident-csi-xgfpt                   2/2     Running   0          81d   142.34.194.11   mcs-klab-app-01.dmz      <none>           <none>
trident-csi-zrwmw                   2/2     Running   0          81d   142.34.194.6    mcs-klab-master-02.dmz   <none>           <none>
trident-operator-6cdc875d88-h84xp   1/1     Running   0          74d   10.97.6.22      mcs-klab-infra-03.dmz    <none>           <none>

The DaemonSet of pods runs one on each host and managed the mounting and unmounting of volumes on that host and status should be Running state. I think this check would be a simple up/down check for the NetApp health status.

And, as we discussed yesterday, creating a pod/job with PVCs will be a more mature and user side usability check, I think.

These are the memo from our internal discussion:

Both netapp-block-standard and netapp-file-standard storage class should be checked as they are using different access methods.
For Check if mount is successful, would be like checking checksum using a simple shell script
Also some check that its actually mounted and not just a root mount folder

checksum example

#for file storage
echo "Hello world" > /test-file/test.txt
sha256sum /test-file/test.txt
1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f  /test-file/test.txt
echo "1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-file/test.txt" | sha256sum --check --strict -
test.txt: OK

#for block storage
echo "Hello world" > /test-block/test.txt
sha256sum /test-block/test.txt
1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f  /test-block/test.txt
echo "1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-block/test.txt" | sha256sum --check --strict -
test.txt: OK

tmorik commented 3 years ago

Also this documents described how to check when we add/patch the nodes in cluster. (Create a project and add app with PVC and do write test on a newly created volume ).

https://github.com/bcgov-c/rhcos-ignition-builder/blob/master/README.md#add-more-nodes-later

(check after To test out the node the following example... sentence. )

tmorik commented 3 years ago

Checking other docs and found this one:

[root@mcs-klab-util ~]# tridentctl -n openshift-bcgov-trident get backend
+-----------------------+-------------------+--------------------------------------+--------+---------+
|         NAME          |  STORAGE DRIVER   |                 UUID                 | STATE  | VOLUMES |
+-----------------------+-------------------+--------------------------------------+--------+---------+
| netapp-file-standard  | ontap-nas-economy | 1e04cba6-0d13-472f-8580-4c1a32f68a29 | online |      34 |
| netapp-block-standard | ontap-san-economy | 27729401-2fc2-4881-ac8c-08cc76774652 | online |      10 |
| netapp-block-extended | ontap-san         | a091cb8c-48ce-4f16-a78a-d86a2c493b8b | online |       9 |
| netapp-file-backup    | ontap-nas-economy | 0f0bc538-63ac-4197-b46c-7aba9e74ba0e | online |       3 |
| netapp-file-extended  | ontap-nas         | f398925a-3648-45d3-97a4-bbe6d73c9bb7 | online |       1 |
+-----------------------+-------------------+--------------------------------------+--------+---------+

Maybe this is more useful to check out status of NetApp status? We can just check if it's "online" for each storage types.

https://netapp-trident.readthedocs.io/en/stable-v18.07/reference/tridentctl.html

StevenBarre commented 3 years ago

When looking at docs, it's best to use the currently installed version. v18.07 is pretty old.

Trident is moving away from tridentctl and using objects in the namespace. ie: oc -n openshift-bcgov-trident get TridentBackend.

The state of a backend is just based on if the Trident controller pod can access the HTTP API on the NetApp, and may not fully indicate the status of the backend. Though an offline backend would prevent new PVCs from creating volumes.

https://github.com/NetApp/trident/blob/86c312bea14f7d9a2318c7d75772460da325c338/frontend/rest/controller_handlers.go#L409-L429

I still think and end-to-end test with a job would be best.

ShellyXueHan commented 3 years ago

okay good to know! So how about for this sprint, let's simply use the oc -n openshift-bcgov-trident get TridentBackend, just need to setup the RBAC for now.

And for the next round, we can start look into the monitoring job: https://app.zenhub.com/workspaces/platform-experience-5bb7c5ab4b5806bc2beb9d15/issues/bcdevops/developer-experience/1084

tmorik commented 3 years ago

OK, we have made a custom role lr-bcdevops-trident-view in the openshift-bcgov-trident namespace. Your service account system:serviceaccount:openshift:bcdevops-admin is bound to it.

Testing from my side is working ok:

# oc --as=system:serviceaccount:openshift:bcdevops-admin -n openshift-bcgov-trident get TridentBackends
NAME        BACKEND                 BACKEND UUID
tbe-8fxpw   netapp-file-extended    eda3f4dd-65a3-43c9-8a51-c6f9753f5f57
tbe-hz8sh   netapp-file-backup      b4671c90-5869-43c7-ad6e-d1e3483f48d1
tbe-jg5rz   netapp-file-standard    8553e386-7ad5-4299-b282-92bbca888bf9
tbe-jjw8v   netapp-block-standard   cb8b5c3f-22b4-4f7e-a7e6-2c6819ea8882
tbe-tz672   netapp-block-extended   c4382864-3f38-459c-9395-10ca84bf9c42

Can you please run oc -n openshift-bcgov-trident get TridentBackend from your end also? It's been set up on KLAB and CLAB.

BCDevOps / developer-experience

cerberus - monitor netapp access #1074

Update:

Update: