Financial-Times / splunk-heroku

Support for Heroku log drains packaged as a Splunk app.
https://tech.in.ft.com/tech-topics/logging/splunk/logging-from-heroku
MIT License
2 stars 0 forks source link

`SEDCMD` rules are not applied to some events #57

Closed sjparkinson closed 1 year ago

sjparkinson commented 2 years ago

We are observing some events retain the heroku source type because the SEDCMD rules are not getting applied, and subsequently the source type transforms are not applied as the regex won't match the non-seded events.

https://github.com/Financial-Times/splunk-heroku/blob/0824fbb1ed02f75226a8efde77c4cb6bc1008ed6/app/default/props.conf#L10-L14

We've used regex101 to confirm that the regexes are correct and should be applied to these events.

My suspicion is that it's a event volume thing, something to do with the HEC endpoint perhaps.

It might also be something to do with where the app's configuration is installed on the cluster, it must be on the indexers (we don't control this from my understanding but Splunk will have a look at our instance).

sjparkinson commented 2 years ago

Example 1

Example raw event

Source type is heroku from the HEC token configuration.

1065 <190>1 2022-09-06T11:42:21.414765+00:00 host app web.6 - {"event":"TOPIC_TRACKER_ENDPOINT_ERROR","route":"/article/:articleId","url":"/article/b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c?numberOfArticles=8&source=app","method":"GET","error":{"message":"network timeout at: https://api.getcrux.io/article_score?user_id=...&article_id=b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c&topic_id=c91b1fad-1097-468b-be82-9a8ff717d54c&num_recs=8&is_rec=false&attempt=0&source=app","stack":"FetchError: network timeout at: https://api.getcrux.io/article_score?user_id=...&article_id=b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c&topic_id=c91b1fad-1097-468b-be82-9a8ff717d54c&num_recs=8&is_rec=false&attempt=0&source=app\n    at Timeout.<anonymous> (/app/node_modules/@financial-times/n-express/node_modules/isomorphic-fetch/node_modules/node-fetch/lib/index.js:1484:13)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)"},"level":"error","message":""}

Expected event after SEDCMD rules applied

Source type should be heroku:app after TRANSFORMS-source_type is applied (after the SEDCMD rules are applied).

{"event":"TOPIC_TRACKER_ENDPOINT_ERROR","route":"/article/:articleId","url":"/article/b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c?numberOfArticles=8&source=app","method":"GET","error":{"message":"network timeout at: https://api.getcrux.io/article_score?user_id=...&article_id=b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c&topic_id=c91b1fad-1097-468b-be82-9a8ff717d54c&num_recs=8&is_rec=false&attempt=0&source=app","stack":"FetchError: network timeout at: https://api.getcrux.io/article_score?user_id=...&article_id=b8998b8e-2bbb-4f7a-8645-8d8a2e51a29c&topic_id=c91b1fad-1097-468b-be82-9a8ff717d54c&num_recs=8&is_rec=false&attempt=0&source=app\n    at Timeout.<anonymous> (/app/node_modules/@financial-times/n-express/node_modules/isomorphic-fetch/node_modules/node-fetch/lib/index.js:1484:13)\n    at listOnTimeout (node:internal/timers:559:17)\n    at processTimers (node:internal/timers:502:7)"},"level":"error","message":""}

Example 2

Example raw event

Source type is heroku from the HEC token configuration.

278 <158>1 2022-09-06T11:42:21.397534+00:00 host heroku router - at=info method=HEAD path="/__gtg" host=specialist-preflight-us.herokuapp.com request_id=7f748199-e753-4b87-bd2f-a38f181d3575 fwd="10.0.0.0" dyno=web.1 connect=0ms service=10ms status=200 bytes=271 protocol=https

Expected event after SEDCMD rules applied

Source type should be heroku:router after TRANSFORMS-source_type is applied (after the SEDCMD rules are applied).

2022-09-06T11:42:21.397534+00:00 heroku router - at=info method=HEAD path="/__gtg" host=specialist-preflight-us.herokuapp.com request_id=7f748199-e753-4b87-bd2f-a38f181d3575 fwd="10.0.0.0" dyno=web.1 connect=0ms service=10ms status=200 bytes=271 protocol=https
sjparkinson commented 1 year ago

Dug into this further with Splunk on a call but didn't make much progress.

A good recent time to review is 15th September 2022 at 12–1pm, for the next-ads-api system.

We're going to continue to monitor this and review if there will be any logs in the _internal index to help debug why SEDCMD on the source type is not getting applied.

sjparkinson commented 1 year ago

The HEC monitoring console dashboard also suggests there are no issues with the data coming in from Heroku to the HTTP Event Collector endpoint.

Screenshot 2022-09-22 at 12 10 04
sjparkinson commented 1 year ago

Closing this as we've not seen issues since the 15th September 2022, and might have been resolved as part of our instance's routine maintenance.

sjparkinson commented 1 year ago

Curiously we saw a burst of unprocessed events on 29th September again, but none since.

That day after observing the issue I raised the following changes, which might have fixed it, but not clear which. I'd bet on it being something similar to https://github.com/Financial-Times/splunk-heroku/issues/56.

Marking as closed completed, but will continue to monitor.