Open msheiny opened 6 years ago
This ticket appears crazily ambitious, and I don't know if it's current. It sounds like you want a full-service DevOps/SRE solution.
The features in alerting that you were seeking are all covered by Prometheus Alertmanager. Alerts can be grouped, filtered, they're written with easy to understand conditional logic, etc. e.g.
- alert: SecureDropInstanceDown
expr: probe_http_status_code != 200
for: 10m
labels:
severity: critical
annotations:
description: '{{ $labels.instance }} at {{ $labels.address }} might be down'
summary: '{{ $labels.instance }} is not returning HTTP 200/OK'
Addressing a couple of other things you mentioned...
For monitoring the Tor network, I highly recommend: https://github.com/atx/prometheus-tor_exporter which I've used at Calyx.
For monitoring containers, see Cadvisor, or secondarily a product like Sysdig.
With regard to aggregated statistics about uploads, I did have that in https://github.com/freedomofpress/securedrop/pull/4414 which was understandably declined.
Container based OS/green-field -- logging/alerting discussion
Feature request
Description
Since we effectively get a clean slate here to re-design the logging and alerting story, let's first break-down what we are trying to collect and when we think admins should be alerted.
A big problem with the current OSSEC design is that alerts are NOT action-able and thereby easily dismissed/neglected. We need to keep that in mind that alerts should only be sent in a situation where we expect an admin to look at the info and make a quick asssessment of whether they need to take action. Many many discussions in github also indicate we need to move away from email (lots of "hey lets move to signal" - like #1124 )
Might want to break this ticket up when implementation time comes to further discuss specific issues, but here we go...
Metric data we want to collect:
Aggregated stats on uploads
- want to avoid collecting meta-data that gives away source post/login times/datesHost Disk/CPU/Memory usage
Tor network status
Administrative/journalist login events
- ssh/console/webSystem config change events
-- new users added/removed, securedrop settings modifiedContainer service status
- when containers bounce, versions are updated, container host-level issuesFile manipulation
- in containers and on host (currently using ossec to handle this)All other logs
- from all containers, hostsBasic features we want:
Nice to have (optional/reach-goals):
When admin should be alerted:
User Stories
As a securedrop administrator I would like sane action-able alerts