Closed ShellyXueHan closed 3 years ago
There is no open URl from netapp that we can make use of directly. @tmorik is going to figure out more details from Trident.
I have confirmed with NetApp Team that there no external URL to check (only internal).
However, the netapp status metrics such as Aggregate Size
or Aggregate_Percentage
for each clusters by Nagios, and have been sent to SysDiag already, And if Nagios failed to get those metrics, Platform-ops will get a notification.
As par conversation with Shelly, Up/Down status of NetApp will be added to the collected metrics in Nagios (and its notification) .
Thanks @tmorik! I would also need an open endpoint (maybe from Nagios API?) where I can query the status of the NetApp storage service directly.
I'll ask about an endpoint & API Key to Nagios monitoring team as I'm not sure if they open up an API access to the eternal clients.
@ShellyXueHan, Talked with Monitoring team, unfortunately, Nagios API is not opened for the external clients yet. They are preparing a gateway app so called "Nagios Fusion" for this purpose, but it's still in the testing phase and not in production yet.
Anyhow, for now we need to look for another option...
Dose it work if we can create a metrics csv file somewhere, or pushing metrics to your app (cerberus?) or some kind of a db? I'm just throwing ideas without knowledge of your app.
@tmorik sorry I missed your message here... but thanks for the effort and info!
So Cerberus have this feature to create custom python code to do tasks, such as making HTTP requests, running oc commands, etc. It doesn't really need to know the whole netapp metrics, as long as there's somewhere that can host up a signal for the storage service status, then all good! Let's have a chat on this when you are back!
So far what we've found out is that there is no accessible status endpoint for netapp storage service that Cerberus can use.
And here's another idea - create a custom check from Cerberus that spins up a pod and mount a PVC to it. What do you think @wmhutchison @mitovskaol ? Chatted with @tmorik about this just now. If you guys think it reasonable, we'll go ahead and try it on :)
Here are some detailed steps for the Cerberus custom check:
@tmorik mentioned there's a similar process that we run in OCP3. He'll find the code base and one of us can test it on a lab cluster!
I have checked some of docs and asked the questions internally.
For checking NetApp/Trident status for each cluster would be simply checking pods status in openshift-bcgov-trident
namespace like below:
[root@mcs-klab-util]# oc -n openshift-bcgov-trident get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
trident-csi-24thc 2/2 Running 0 81d 142.34.194.5 mcs-klab-master-01.dmz <none> <none>
trident-csi-4djvn 2/2 Running 0 81d 142.34.194.9 mcs-klab-infra-02.dmz <none> <none>
trident-csi-676c588cfc-n4prk 6/6 Running 0 74d 10.97.8.8 mcs-klab-infra-01.dmz <none> <none>
trident-csi-7xm2w 2/2 Running 0 81d 142.34.194.14 mcs-klab-app-04.dmz <none> <none>
trident-csi-df8s5 2/2 Running 0 81d 142.34.194.13 mcs-klab-app-03.dmz <none> <none>
trident-csi-h6lzn 2/2 Running 0 81d 142.34.194.10 mcs-klab-infra-03.dmz <none> <none>
trident-csi-hcwzg 2/2 Running 0 81d 142.34.194.7 mcs-klab-master-03.dmz <none> <none>
trident-csi-kn75w 2/2 Running 0 81d 142.34.194.8 mcs-klab-infra-01.dmz <none> <none>
trident-csi-p52jw 2/2 Running 0 81d 142.34.194.12 mcs-klab-app-02.dmz <none> <none>
trident-csi-xgfpt 2/2 Running 0 81d 142.34.194.11 mcs-klab-app-01.dmz <none> <none>
trident-csi-zrwmw 2/2 Running 0 81d 142.34.194.6 mcs-klab-master-02.dmz <none> <none>
trident-operator-6cdc875d88-h84xp 1/1 Running 0 74d 10.97.6.22 mcs-klab-infra-03.dmz <none> <none>
The DaemonSet of pods runs one on each host and managed the mounting and unmounting of volumes on that host and status should be Running
state. I think this check would be a simple up/down check for the NetApp health status.
And, as we discussed yesterday, creating a pod/job with PVCs will be a more mature and user side usability check, I think.
These are the memo from our internal discussion:
netapp-block-standard
and netapp-file-standard
storage class should be checked as they are using different access methods.Check if mount is successful
, would be like checking checksum using a simple shell scriptchecksum example
#for file storage
echo "Hello world" > /test-file/test.txt
sha256sum /test-file/test.txt
1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-file/test.txt
echo "1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-file/test.txt" | sha256sum --check --strict -
test.txt: OK
#for block storage
echo "Hello world" > /test-block/test.txt
sha256sum /test-block/test.txt
1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-block/test.txt
echo "1894a19c85ba153acbf743ac4e43fc004c891604b26f8c69e1e83ea2afc7c48f /test-block/test.txt" | sha256sum --check --strict -
test.txt: OK
Also this documents described how to check when we add/patch the nodes in cluster. (Create a project and add app with PVC and do write test on a newly created volume ).
https://github.com/bcgov-c/rhcos-ignition-builder/blob/master/README.md#add-more-nodes-later
(check after To test out the node the following example...
sentence. )
Checking other docs and found this one:
[root@mcs-klab-util ~]# tridentctl -n openshift-bcgov-trident get backend
+-----------------------+-------------------+--------------------------------------+--------+---------+
| NAME | STORAGE DRIVER | UUID | STATE | VOLUMES |
+-----------------------+-------------------+--------------------------------------+--------+---------+
| netapp-file-standard | ontap-nas-economy | 1e04cba6-0d13-472f-8580-4c1a32f68a29 | online | 34 |
| netapp-block-standard | ontap-san-economy | 27729401-2fc2-4881-ac8c-08cc76774652 | online | 10 |
| netapp-block-extended | ontap-san | a091cb8c-48ce-4f16-a78a-d86a2c493b8b | online | 9 |
| netapp-file-backup | ontap-nas-economy | 0f0bc538-63ac-4197-b46c-7aba9e74ba0e | online | 3 |
| netapp-file-extended | ontap-nas | f398925a-3648-45d3-97a4-bbe6d73c9bb7 | online | 1 |
+-----------------------+-------------------+--------------------------------------+--------+---------+
Maybe this is more useful to check out status of NetApp status? We can just check if it's "online" for each storage types.
https://netapp-trident.readthedocs.io/en/stable-v18.07/reference/tridentctl.html
When looking at docs, it's best to use the currently installed version. v18.07 is pretty old.
Trident is moving away from tridentctl
and using objects in the namespace. ie: oc -n openshift-bcgov-trident get TridentBackend
.
The state of a backend is just based on if the Trident controller pod can access the HTTP API on the NetApp, and may not fully indicate the status of the backend. Though an offline backend would prevent new PVCs from creating volumes.
I still think and end-to-end test with a job would be best.
okay good to know! So how about for this sprint, let's simply use the oc -n openshift-bcgov-trident get TridentBackend
, just need to setup the RBAC for now.
And for the next round, we can start look into the monitoring job: https://app.zenhub.com/workspaces/platform-experience-5bb7c5ab4b5806bc2beb9d15/issues/bcdevops/developer-experience/1084
OK, we have made a custom role lr-bcdevops-trident-view
in the openshift-bcgov-trident
namespace. Your service account system:serviceaccount:openshift:bcdevops-admin
is bound to it.
Testing from my side is working ok:
# oc --as=system:serviceaccount:openshift:bcdevops-admin -n openshift-bcgov-trident get TridentBackends
NAME BACKEND BACKEND UUID
tbe-8fxpw netapp-file-extended eda3f4dd-65a3-43c9-8a51-c6f9753f5f57
tbe-hz8sh netapp-file-backup b4671c90-5869-43c7-ad6e-d1e3483f48d1
tbe-jg5rz netapp-file-standard 8553e386-7ad5-4299-b282-92bbca888bf9
tbe-jjw8v netapp-block-standard cb8b5c3f-22b4-4f7e-a7e6-2c6819ea8882
tbe-tz672 netapp-block-extended c4382864-3f38-459c-9395-10ca84bf9c42
Can you please run oc -n openshift-bcgov-trident get TridentBackend
from your end also? It's been set up on KLAB and CLAB.
Describe the issue We want to monitor on netapp access according to https://miro.com/app/board/o9J_kgyjm_k=/
Cerberus can have custom checks setup, so we need to know what endpoint is available from netapp, and what are the SLIs.
Definition of done