Complete the metrics exposed by Dependency Track for better monitoring

DependencyTrack / dependency-track

Dependency-Track is an intelligent Component Analysis platform that allows organizations to identify and reduce risk in the software supply chain.

Apache License 2.0

2.71k stars 579 forks source link

Current Behavior:

Since v4.6, Dependency Track expose some metrics using the couple Micrometer and Prometheus. Most of the metrics (if not all are directly provided by Alpine framework) and are related to technical components of DT :

JVM
ExecutorService
HikariCP pool
DataNucleus
Alpine Event subsystem

It could be interesting for monitoring to add more metrics.

Proposed Behavior:

I propose the following non exhaustive list of metrics (Feel free to complete, revoke those metrics). The provided Grafana dashboard should be updated accordingly.

Technical

http_server_requests_seconds to track inbound API requests
http_client_requests_seconds to track outbound API requests
resilience4j_* metrics to monitor retry and ratelimiter (i.e. for Snyk) features
cache_* metrics to monitor cache efficiency (size, hit/miss ratio)
task_execution to track the performance of background tasks (time taken)

Note

Some rework will be needed on the way client request URI are built to have URI pattern to avoid clogging the metrics stream with too many different URIs.

Notification metric recording is done after subscriptions check (in https://github.com/stevespringett/Alpine/blob/master/alpine-infra/src/main/java/alpine/notification/NotificationService.java#L103) meaning there will be no metrics if there are no subscribers. The metric publication should not be correlated to subscriptions IMHO.

Some of the metrics above could be implemented in Alpine framework globally

Functional (or DT specifics)

I can't think of any or rather they are already implemented in frontend dashboard (# projects, # components,...)

Raising to p2 as this becomes increasingly more important to debug performance issues users may run into.

We can currently only roughly pinpoint bottlenecks or blocking tasks, using the event system metrics. However, that's not all that useful when being confrontend with "BOM upload processing takes too long, what is the blocker?".

A few metrics that come to mind that would be good to collect:

BOM / VEX processing

BOM upload processing duration
BOM upload processing success and failure rates
VEX upload processing duration
VEX upload processing success and failure rates

Vulnerability analysis

Vulnerability analysis task duration
- A Tag should denote the target of the analysis (portfolio, project, component)
Internal vulnerability analysis task duration
OSS Index vulnerability analysis task duration
OSS Index average batch size
Snyk vulnerability analysis task duration

Policy Evaluation

Policy evaluation task duration
- A Tag should denote the target of the evaluation (project, component)

Metrics Updates

Metrics update task duration
- A Tag should denote the target of the evaluation (portfolio, project, component, vulnerabilities)

Mirroring

EPSS mirroring task duration
GitHub Advisories mirroring task duration
NVD mirroring task duration
OSV mirroring task duration

Repository Meta Analysis

Repository meta analysis task duration
Repository meta analysis duration
- This would be per analyzer (e.g. Gem, Maven, NPM)
- Potentially with a Tag that denotes the repository by name (e.g. central)

DependencyTrack / dependency-track