Closed WadeBarnes closed 1 year ago
Using the links from this resource card to access the dashboards; https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards
The links to the test
an prod
dashboards are broken:
The links to all of the Backup
dashboards for all environments are missing from the card.
The graph for the DTS Service Test
environment is labeled BC Reg Test PVC Usage
The aries-endorser-db
display for the DTS Service Prod
environment is labeled Event-Processor-Log-DB-Primary PVC usage
There are 7 PVCs in the 4a9599-dev
environment, for which only the details of 5 are listed. All 7 are listed in the related graph.
There are 7 PVCs in the 4a9599-test
environment, for which only the details of 5 are listed. All 7 are listed in the related graph.
There are 14 PVCs in the 4a9599-prod
environment, for which only the details of 5 are listed. 13 of the 14 are listed in the related graph.
aries-endorser-wallet
is missing from the dashboard.
Please group the mediator and endorser metrics together. For example the aries-mediator-db
metrics are not grouped with the other mediator metrics.
Same comments as dev
.
aries-endorser-wallet
is missing from the dashboard too.
Metrics for the allure and matomo services are missing from the dashboard.
The backup dashboards should either display the backup and backup verification PVC use, or contain a link to the associated PVC dashboard.
The metrics seem to be tracking container/pod instances rather than the more generic container/pod names. Using the dev environment as an example I performed a rollout of all of the pods and watched the metrics on the corresponding dashboard. Many of the metrics were lost after the new containers started indicating the metrics are watching specific instances. The moments of the data loss can be seen over this time range; https://app.sysdigcloud.com/#/dashboards/336843?from=1670259120&to=1670259450&scope=kubernetes.namespace.name%20in%20%28%224a9599-dev%22%29
I suspect the test
and prod
dashboards will be affected by the same issue.
@rajpalc7, The following items still need to be addressed.
https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards
[DTS Service-Prod Allure]
does to the DTS Service - Persistent Volume Claims - 4a9599
dashboard.There are 14 PVCs in the 4a9599-prod
environment, for which only the details of 13 are listed. Also only 13 of the 14 are listed in the related graph.
aries-endorser-wallet
is listed now, but only the pod count is being graphed.
Allure and Matomo are different services. If you are going to split them up from the other DTS Service Dashboards they should each have their own dashboard.
The metrics for backup-mariadb
should also go onto it's own dashboard to be consistent with what you did with the other backup dashboards.
@WadeBarnes
https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards
Links should be working correctly now:
PVC Dashboard: Prod So the 1 missing PVC is certbot, I tried to include its data in dashboard but it produces no result, looks like its again a sysdig issue. (It could also be because certbot is not currently used by any of the pod)
Metrics Dashboard: Dev aries-endorser-wallet is listed now, but only the pod count is being graphed. - This is again a sysdig issue (no data found, raised a ticket with Dustin regarding this)
DTS Service - Prod Allure - 4a9599 Dashboard This is fixed now
https://trello.com/c/Mj9EqIzq/88-sysdig-dashboards
BCReg-Dev
link goes to the wrong dashboard, it is linked the BCReg-Test
Dashboard.BCReg PVCs
link goes to a dashboard that does not exist.allure-service
label from the bottom of the page.backup-mariabd
- since you moved it to its own dashboard.matomo
- since you moved it to its own dashboard.matomo-db
- since you moved it to the new matomo
dashboard.DTS Service - Persistent Volume Claims
DTS Service - Prod Backup-Mariadb
CPU Core %
are due to divide by zero error. The query is using sysdig_container_cpu_cores_quota_limit
which for backup containers is purposely going to be zero. Look at how the other backup dashboards display these metrics.DTS Service - Persistent Volume Claims
- Please create a separate ticket to track this issue and bring it up with Dustin.
@rajpalc7, Where is the ticket to track this issue?
@WadeBarnes - https://github.com/bcgov/DITP-DevOps/issues/29
New ticket created. Will follow up regarding this once the sysdig metrics tuning is completed by Shelly from Platform team
While working with Certbot yesterday I ran into some resource issues in the
Digital Trust Shared Service (prod)
(4a9599-prod
) environment. I'm looking at submitting a quota increase, but we need to back the request with some data and do some homework first. I'd like you to review the documents here, https://docs.developer.gov.bc.ca/request-quota-increase-for-openshift-project-set/, and setup sysdig monitoring on theDigital Trust Shared Service
(4a9599
) projects so we can collect some resource usage data, and first of all use that to perform some resource tuning on the pod instances in those environments. Please read the linkedResource management guidelines
andApplication resource tuning
documents too, because I want to get you involved in the tuning process.