bcgov / DITP-DevOps

Digital Identity and Trust Program Team's DevOps Documentation Repository
Apache License 2.0
2 stars 6 forks source link

Setup sysdig monitoring for the Digital Trust Shared Service (4a9599) namespaces #13

Closed WadeBarnes closed 1 year ago

WadeBarnes commented 2 years ago

While working with Certbot yesterday I ran into some resource issues in the Digital Trust Shared Service (prod) (4a9599-prod) environment. I'm looking at submitting a quota increase, but we need to back the request with some data and do some homework first. I'd like you to review the documents here, https://docs.developer.gov.bc.ca/request-quota-increase-for-openshift-project-set/, and setup sysdig monitoring on the Digital Trust Shared Service (4a9599) projects so we can collect some resource usage data, and first of all use that to perform some resource tuning on the pod instances in those environments. Please read the linked Resource management guidelines and Application resource tuning documents too, because I want to get you involved in the tuning process.

WadeBarnes commented 1 year ago

Review

Using the links from this resource card to access the dashboards; https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards The links to the test an prod dashboards are broken: image image

The links to all of the Backup dashboards for all environments are missing from the card.

PVC Dashboard:

The graph for the DTS Service Test environment is labeled BC Reg Test PVC Usage

The aries-endorser-db display for the DTS Service Prod environment is labeled Event-Processor-Log-DB-Primary PVC usage

Dev

There are 7 PVCs in the 4a9599-dev environment, for which only the details of 5 are listed. All 7 are listed in the related graph.

Test

There are 7 PVCs in the 4a9599-test environment, for which only the details of 5 are listed. All 7 are listed in the related graph.

Prod

There are 14 PVCs in the 4a9599-prod environment, for which only the details of 5 are listed. 13 of the 14 are listed in the related graph.

Metrics Dashboard:

Dev

aries-endorser-wallet is missing from the dashboard. Please group the mediator and endorser metrics together. For example the aries-mediator-db metrics are not grouped with the other mediator metrics.

Test

Same comments as dev.

Prod

aries-endorser-wallet is missing from the dashboard too. Metrics for the allure and matomo services are missing from the dashboard.

WadeBarnes commented 1 year ago

The backup dashboards should either display the backup and backup verification PVC use, or contain a link to the associated PVC dashboard.

WadeBarnes commented 1 year ago

The metrics seem to be tracking container/pod instances rather than the more generic container/pod names. Using the dev environment as an example I performed a rollout of all of the pods and watched the metrics on the corresponding dashboard. Many of the metrics were lost after the new containers started indicating the metrics are watching specific instances. The moments of the data loss can be seen over this time range; https://app.sysdigcloud.com/#/dashboards/336843?from=1670259120&to=1670259450&scope=kubernetes.namespace.name%20in%20%28%224a9599-dev%22%29

I suspect the test and prod dashboards will be affected by the same issue.

WadeBarnes commented 1 year ago

@rajpalc7, The following items still need to be addressed.

https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards

PVC Dashboard:

Prod

There are 14 PVCs in the 4a9599-prod environment, for which only the details of 13 are listed. Also only 13 of the 14 are listed in the related graph.

Metrics Dashboard:

Dev

aries-endorser-wallet is listed now, but only the pod count is being graphed.

DTS Service - Prod Allure - 4a9599 Dashboard

Allure and Matomo are different services. If you are going to split them up from the other DTS Service Dashboards they should each have their own dashboard. The metrics for backup-mariadb should also go onto it's own dashboard to be consistent with what you did with the other backup dashboards.

rajpalc7 commented 1 year ago

@WadeBarnes

https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards

Links should be working correctly now:

PVC Dashboard: Prod So the 1 missing PVC is certbot, I tried to include its data in dashboard but it produces no result, looks like its again a sysdig issue. (It could also be because certbot is not currently used by any of the pod)

Metrics Dashboard: Dev aries-endorser-wallet is listed now, but only the pod count is being graphed. - This is again a sysdig issue (no data found, raised a ticket with Dustin regarding this)

DTS Service - Prod Allure - 4a9599 Dashboard This is fixed now

WadeBarnes commented 1 year ago

https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards

DTS Service - Test

DTS Service - Prod Allure

DTS Service - Persistent Volume Claims

DTS Service - Prod Backup-Mariadb

WadeBarnes commented 1 year ago

DTS Service - Persistent Volume Claims

  • Please create a separate ticket to track this issue and bring it up with Dustin.

@rajpalc7, Where is the ticket to track this issue?

rajpalc7 commented 1 year ago

@WadeBarnes - https://github.com/bcgov/DITP-DevOps/issues/29

New ticket created. Will follow up regarding this once the sysdig metrics tuning is completed by Shelly from Platform team