Collect metrics from the Vulnerability repo

sharon-fdm commented 4 months ago

Goal

User story
As an engineer in FleetDM,
I want to know about any malfunction in the vuln repo as soon as possible
so that I can fix it and avoid customers not getting vuln info from it.

Context

We currently collect metrics and send it to DataDog. Use the same mechanism to send info from the GitHub action on the vuln repo directly to DataDog. Collect:

How much time passed from the previous release to the current one.
How many CVEs we published.
Avg number of CPEs to CVE
How many times the release was downloaded in the last 24 hours

Changes

Product

[ ] UI changes: TODO
[ ] CLI usage changes: TODO
[ ] REST API changes: TODO
[ ] Permissions changes: TODO
[ ] Outdated documentation changes: TODO
[ ] Changes to paid features or tiers: TODO

Engineering

[ ] Database schema migrations: TODO
[ ] Load testing: TODO

ℹ️ Please read this issue carefully and understand it. Pay special attention to UI wireframes, especially "dev notes".

QA

Risk assessment

Requires load testing: TODO
Risk level: Low / High TODO
Risk description: TODO

Manual testing steps

Step 1
Step 2
Step 3

Testing notes

Confirmation

[ ] Engineer (@____): Added comment to user story confirming successful completion of QA.
[ ] QA (@____): Added comment to user story confirming successful completion of QA.

sharon-fdm commented 4 months ago

cc: @lukeheath @noahtalerman @mostlikelee Moved to specified per this:

lukeheath commented 4 months ago

@sharon-fdm This seems valuable because it will give us visibility into what's happening inside the vulnerabilities GitHub workflow. Could we also report how many times the release has been downloaded?

Our ability to work on this will be dependent on the estimate.

lukeheath commented 4 months ago

@sharon-fdm I'm assigning back to you to take to estimation.

sharon-fdm commented 4 months ago

@lukeheath

how many times the release has been downloaded

Good metric. Added.

sharon-fdm commented 4 months ago

Vuln Repo: 5 points Heroku + Datadog : 1 point

lukeheath commented 4 months ago

@sharon-fdm It won't be easy to prioritize this soon at a 5-point estimate. Can we reduce scope and not include datadog at all? What if we just fire a Slack notification to #help-engineering if the job fails? Seems like we could do that in 1-2 hours.

sharon-fdm commented 4 months ago

@lukeheath makes sense to shoot critical alerts only. @mostlikelee, two questions:

Can we reduce the scope to reduce the effort to shoot info to Heroku?
If not, we can create a slack app to easily send events. (I just created such app a month ago and can help with that)

TMWYH

mostlikelee commented 4 months ago

@sharon-fdm we already have failures posting to the P1-Help channel. We could timebox the metric effort to 2-3 points.

fleetdm / fleet