AndreZiviani / aws-health-exporter

Prometheus exporter for AWS Health events that also sends them to Slack
Apache License 2.0
3 stars 0 forks source link

Start without Slack #3

Open sureshoss opened 5 months ago

sureshoss commented 5 months ago

This is impressive, is there an option that we can start the application without the slack integration, as we dont have options to connect to slack in the org

AndreZiviani commented 5 months ago

Hi @sureshoss I can make the slack integration optional but it would not make much sense as is because that is the only integration available. I ended up not implementing the prometheus metrics (it is only used as a timer to check AWS Health for new events) but I can look into it again, what is your use case?

sureshsubramaniam commented 5 months ago

Thanks for your response, my use case is to get the AWS health data across regions exported and stored in Prometheus for visualizing ina Grafana map panel with traffic lights. And also was looking to see if we can scrape the account and resource level stats in the same way so we can build a drill down dashboards from the region to accounts and to the resources.

AndreZiviani commented 5 months ago

@sureshsubramaniam @sureshoss Do you think a metric like would solve your needs?

aws_health_exporter_event{accountid=<>, region=<>, service=<>}

I'm not sure adding the affected resources as labels is a good idea due to cardinality issues but maybe I can create a flag to enable it. The value of the metric could be the number of updates on that event, going back to zero when closed/resolved

AndreZiviani commented 5 months ago

I tried implementing metrics support but found a few issues with AWS API:

The official AWS AHA implementation also does not have this concept of state where it does something if the event is opened or closed, it only notifies that something changed so I assume it is not possible (or practical) to try implementing something like that

These are some example metrics of what I managed to implement, I think the best route will be only a counter that increments on each update and resets on exporter restart, any suggestions?

aws_health_event{category="issue",code="AWS_EC2_OPERATIONAL_ISSUE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="PUBLIC",service="EC2"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_ELASTICACHE_UPDATE_AVAILABLE",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="ELASTICACHE"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_RDS_OPERATIONAL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-1",scope="ACCOUNT_SPECIFIC",service="RDS"} 1
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_REDUNDANCY_LOSS",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 0
aws_health_event{account="<redacted>",category="accountNotification",code="AWS_VPN_SINGLE_TUNNEL_NOTIFICATION",otel_scope_name="aws-health-exporter",otel_scope_version="",region="us-east-2",scope="ACCOUNT_SPECIFIC",service="VPN"} 1
AndreZiviani commented 5 months ago

If you want to give it a shot, but keep in mind this is untested https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.0

sureshoss commented 5 months ago

Thanks @AndreZiviani , i will test it and update you, with some more comments

sureshoss commented 5 months ago

Initial testing: Dependency on the GLIBC from the compiled binary

-> ./aws-health-exporter --help

./aws-health-exporter: /lib64/libc.so.6: version GLIBC_2.32' not found (required by ./aws-health-exporter) ./aws-health-exporter: /lib64/libc.so.6: versionGLIBC_2.34' not found (required by ./aws-health-exporter

I will be compling with the GLIBC version that i have in my system and update

sureshoss commented 5 months ago

I compiled and started on my linux machine, however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org. i am running it on a EC2 with redhat linux

-> ./aws-health-exporter --log-level debug --log-events true

DEBU[0000] Set log level to debug INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=true] INFO[0000] Starting metric http endpoint [address=:8080, path=/metrics, regions=all-regions]

There are no debug logs printed to identify te issue i see only the aws_health_process_runtime_go_gc_pause_ns_bucket, aws_health_process_runtime_go_mem_live_objects much of them related to the exporter not the actual metrics like what you see

AndreZiviani commented 5 months ago

however the exporter starts without issue but i am unable to see any of the health metrics for the account or for the org. that is expected because the exporter is stateless it will only check for new updates since the last time it was scraped (or started)

I've added a hidden command to inject some time on the first scrape, try running with --time-shift -240h to force it to look all events on the last 10 days

AndreZiviani commented 5 months ago

Initial testing: Dependency on the GLIBC from the compiled binary

I forgot to disable CGO on release binaries, latest version should work for you https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.1

sureshsubramaniam commented 5 months ago

Awesome let me give it a try today and update you

sureshoss commented 4 months ago

@AndreZiviani I took a shot to run the latest build and seems there is a panic in the code Howeveri checked using the aws cli and was able to get the events without the throttle

-> ./aws-health-exporter -v debug -r us-east-1 --time-shift -240h

DEBU[0000] Set log level to debug INFO[0000] Starting AWS Health Exporter. [log-level=debug,log-events=false] INFO[0017] Starting metric http endpoint [address=:8080, path=/metrics, regions=us-east-1] panic: operation error Health: DescribeAffectedAccountsForOrganization, exceeded maximum number of attempts, 3, https response error StatusCode: 429, RequestID: xxx-xxx-xxxx-xxxx-xxxxxx, api error ThrottlingException: Rate exceeded

goroutine 68 [running]: github.com/AndreZiviani/aws-health-exporter/exporter.Metrics.getAffectedAccountsForOrg({0xc000112680, 0x0, {0x0, 0x0}, {0x0, 0x0}, 0x13ed300, {0xc17aad47cc2b13c3, 0xfffcee3255623100, 0x13f8b40}, ...}, ...) /home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:67 +0x208 github.com/AndreZiviani/aws-health-exporter/exporter.(Metrics).EnrichOrgEvents(0xc0002da200, {0xedaf58, 0x1427ce0}, {0xc0004a9c00, 0x0, {0xc00033afb0, 0x10}, {0xc000360618, 0x13}, 0xc0004a9c10, ...}) /home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:50 +0x146 github.com/AndreZiviani/aws-health-exporter/exporter.(Metrics).GetOrgEvents(0xc0002da200) /home/runner/work/aws-health-exporter/aws-health-exporter/exporter/org.go:36 +0x36e github.com/AndreZiviani/aws-health-exporter/exporter.(Metrics).GetHealthEvents(0xc0002da200) /home/runner/work/aws-health-exporter/aws-health-exporter/exporter/health.go:29 +0x33 github.com/AndreZiviani/aws-health-exporter/exporter.NewMetrics.func1({0xcfb660?, 0x1427ce0?}, {0xed9dc0, 0xc0004ae060}) /home/runner/work/aws-health-exporter/aws-health-exporter/exporter/metrics.go:27 +0x48 go.opentelemetry.io/otel/sdk/metric.(meter).RegisterCallback.func1({0xedaf58, 0x1427ce0}) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v1.24.0/meter.go:445 +0x55 go.opentelemetry.io/otel/sdk/metric.(pipeline).produce(0xc0000fe510, {0xedaf58, 0x1427ce0?}, 0xc000352060) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v1.24.0/pipeline.go:134 +0x314 go.opentelemetry.io/otel/sdk/metric.(ManualReader).Collect(0xc0000a3860, {0xedaf58, 0x1427ce0}, 0xc000352060) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/sdk/metric@v1.24.0/manual_reader.go:123 +0xe2 go.opentelemetry.io/otel/exporters/prometheus.(collector).Collect(0xc0002ea000, 0xc000069f60?) /home/runner/go/pkg/mod/go.opentelemetry.io/otel/exporters/prometheus@v0.46.0/exporter.go:158 +0x72 github.com/prometheus/client_golang/prometheus.(Registry).Gather.func1() /home/runner/go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/registry.go:457 +0xe7 created by github.com/prometheus/client_golang/prometheus.(*Registry).Gather in goroutine 15 /home/runner/go/pkg/mod/github.com/prometheus/client_golang@v1.19.0/prometheus/registry.go:547 +0xbab

AndreZiviani commented 4 months ago

@sureshoss That's odd, looks like you have a lot accounts/events and the api is blocking you but the SDK should handle retires and rate-limit, will try to look into it

AndreZiviani commented 4 months ago

hey @sureshoss I wasn't able to reproduce your issue, probably because I don't have enough events/resources but I've changed the logic on the retryer please let me know if this fix your issue. If it does not then I can be more explicit and increase some other parameters https://github.com/AndreZiviani/aws-health-exporter/releases/tag/v0.1.2