Publish errors with Cloudwatch Logs

cch1 commented 1 year ago

I have recently enabled the cloudwatch logs publisher. From time to time I see a spurt of logged errors like this one:

mulog/publisher-error

clojure.lang.ExceptionInfo: μ/log cloudwatch publisher publish failure, group 'stt' stream '4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h' reason: {:rs {:cognitect.anomalies/category :cognitect.anomalies/not-found, :cognitect.anomalies/message "logs.us-east-1.amazonaws.com"}, :group-name "stt", :stream-name "4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h"} at com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invokeStatic(cloudwatch.clj:58) at com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invoke(cloudwatch.clj:43) at com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invokeStatic(cloudwatch.clj:76) at com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invoke(cloudwatch.clj:70) at com.brunobonacci.mulog.publishers.cloudwatch.CloudwatchPublisher.publish(cloudwatch.clj:103) at com.brunobonacci.mulog.core$start_publisher_BANG_$publish_attempt__25392.invoke(core.clj:194) at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2050) at clojure.lang.AFn.applyToHelper(AFn.java:154) at clojure.lang.RestFn.applyTo(RestFn.java:132) at clojure.lang.Agent$Action.doRun(Agent.java:114) at clojure.lang.Agent$Action.run(Agent.java:163) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829)
@ingestionTime
1692143237705
@log
219246831182:stt
@logStream
4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h
@message
{"publisher-type":"cloudwatch","mulog/namespace":"clojure.core","publisher-id":"4ribfKl7l_9QZVscjcIGBt3yYKi8Eigw","mulog/action":"publish","mulog/timestamp":1692143236553,"exception":"clojure.lang.ExceptionInfo: \u03bc/log cloudwatch publisher publish failure, group 'stt' stream '4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h' reason: {:rs {:cognitect.anomalies/category :cognitect.anomalies/not-found, :cognitect.anomalies/message \"logs.us-east-1.amazonaws.com\"}, :group-name \"stt\", :stream-name \"4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h\"}\n\tat com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invokeStatic(cloudwatch.clj:58)\n\tat com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invoke(cloudwatch.clj:43)\n\tat com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invokeStatic(cloudwatch.clj:76)\n\tat com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invoke(cloudwatch.clj:70)\n\tat com.brunobonacci.mulog.publishers.cloudwatch.CloudwatchPublisher.publish(cloudwatch.clj:103)\n\tat com.brunobonacci.mulog.core$start_publisher_BANG_$publish_attempt__25392.invoke(core.clj:194)\n\tat clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2050)\n\tat clojure.lang.AFn.applyToHelper(AFn.java:154)\n\tat clojure.lang.RestFn.applyTo(RestFn.java:132)\n\tat clojure.lang.Agent$Action.doRun(Agent.java:114)\n\tat clojure.lang.Agent$Action.run(Agent.java:163)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base/java.lang.Thread.run(Thread.java:829)\n","mulog/origin":"mulog/core","mulog/trace-id":"4rijzIExyYS007aE3aYml7tBAw1VJNXe","mulog/event-name":"mulog/publisher-error"}
@timestamp
1692143236553
exception
clojure.lang.ExceptionInfo: μ/log cloudwatch publisher publish failure, group 'stt' stream '4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h' reason: {:rs {:cognitect.anomalies/category :cognitect.anomalies/not-found, :cognitect.anomalies/message "logs.us-east-1.amazonaws.com"}, :group-name "stt", :stream-name "4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h"}
at com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invokeStatic(cloudwatch.clj:58)
at com.brunobonacci.mulog.publishers.cloudwatch$publish_BANG_.invoke(cloudwatch.clj:43)
at com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invokeStatic(cloudwatch.clj:76)
at com.brunobonacci.mulog.publishers.cloudwatch$put_log_events.invoke(cloudwatch.clj:70)
at com.brunobonacci.mulog.publishers.cloudwatch.CloudwatchPublisher.publish(cloudwatch.clj:103)
at com.brunobonacci.mulog.core$start_publisher_BANG_$publish_attempt__25392.invoke(core.clj:194)
at clojure.core$binding_conveyor_fn$fn__5823.invoke(core.clj:2050)
at clojure.lang.AFn.applyToHelper(AFn.java:154)
at clojure.lang.RestFn.applyTo(RestFn.java:132)
at clojure.lang.Agent$Action.doRun(Agent.java:114)
at clojure.lang.Agent$Action.run(Agent.java:163)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)

BrunoBonacci commented 1 year ago

µ/log uses itself to log internal errors. This one is to notify you that the publisher failed to publish the payload.

This could happen for various reasons, most commonly:

You have some data in your events that cannot be serialized into JSON (like java objects)
The :transform threw an exception
There was a network error during the transmission

or like in this case, missing resources or missing permissions

{:rs 
{:cognitect.anomalies/category :cognitect.anomalies/not-found, 
:cognitect.anomalies/message "logs.us-east-1.amazonaws.com"}, 
:group-name "stt", 
:stream-name "4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h"}

The above error seems to indicate that either the group stt or the stream 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h where not present in the us-east-1 region (is that your region?)

The thing that puzzles me is that you see this error during the publishing phase. When you initialize the publisher, it will create a log-stream, I would have expected this error during the initialization (start-publisher!) but not during the routing publish-events.

To get to the bottom of this please check

confirm that your region is us-east-1 (just to make sure it is not trying to connect to a different region for some reason)
make sure that the stt group is created
check log eviction policy and make sure that the logstreams like 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h do not get deleted while the process is trying to push more logs
finally, the error message seems to suggest the resource not found is: logs.us-east-1.amazonaws.com which seems to suggest a DNS problem. Are you running in a container management system like Kubernetes, if so what base image are you using? Maybe Alpine? if so are you experiencing random network errors from time-to-time? If so, there is a well-documented issue with Alpine Kernel which uses the Musl standard library instead of glibc. The issue has to do with how Musl handles the DNS requests.

If you are experiencing other application errors consider this as a possible source. I hope this helps.

cch1 commented 1 year ago

Region is correct (us-east-1)
Log group stt was created long before this error appeared.
Log retention (not eviction) policy is set to "one month".
I am running in a Datomic Ion (EC2).
No other abnormal application errors.

Interestingly, the error messages I observed contains an ex-info with a stream attribute of 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h and a reason attribute that indicates (in the cognitect AWS anomaly) that this stream does not exist. Yet this very message is published in that stream!! Is it possible that there is a race condition where the stream does not quite yet exist for publishing (provoking the error) and then moments later it does exist and can receive the internal mulog error?

cch1 commented 1 year ago

This screenshot shows that the log stream that was reported as unavailable was available milliseconds before the reported error (it holds the st.system.params/SSM-parameters-read event on line 21 as the first event in the stream). But the only other events in that stream are the internal mulog errors -perhaps mulog created a new stream as a result of the error condition.

BrunoBonacci commented 1 year ago

That type of race condition is not possible in µ/log. In order to use a publisher, the publisher needs to be initialized, and the log-stream is created during the initialization (before it gets added).

There is one more scenario that would explain all the situations you pictured and the screenshot you sent me. Most of the APIs in AWS are eventually consistent, meaning that after the creation of a new resource, It could take a few seconds before it is ready to be used. I see this all the time during automated integration testing. If during the test I create a bucket, a Dynamo table, etc and then try to use it immediately I often get a 404 error (or other errors). In the test environment, I have to add some routine checks to verify that the resource I want to use is now available & ready to use.

Here could be the same, usually the publisher is initialized when the process starts up. During the publisher initialization, the log stream will be created, and if you immediately try to use it, and AWS happens to have a bad day and be under load, it could take a few seconds until the new resource is fully available.

I would suggest keeping an eye on it, if the error is what I described it should self-heal after a few seconds.

cch1 commented 1 year ago

I would suggest keeping an eye on it, if the error is what I described it should self-heal after a few seconds. Well, it did heal! It apparently healed by immediately creating another log stream.

But that begs two questions: could mulog thrash by creating a stream, writing to it, failing to write (because eventual consistency), creating a new log stream, fail to write, ... In other words, should mulog have a backoff timer to prevent such thrashing?

My other concern is the loss of logged events -but I'm not sure how mulog recovers from a failure to write a message in this case (I'm aware that mulog drops excessive log messages, but this is different). So maybe it's a problem, maybe not. But I worry about it.

BrunoBonacci commented 1 year ago

It doesn't create another log stream, it just wait for the one created to become available. no DDOS, no extra log streams created, no event lost in the meantime.

This is what happens:

Since µ/log received a 200 on the creation of the logstream, it will expect the log stream to eventually be there. During the whole time it will keep buffering events in memory and at regular interval it will try to publish the content of the buffer. If the publish attempt fails, it will just keep buffering and retry later. You can find out more about the internals here: https://cljdoc.org/d/com.brunobonacci/mulog/0.9.0/doc/%CE%BC-log-internals

cch1 commented 1 year ago

That sounds very resilient. I assumed that the creation of a new stream was the reaction to a 404 -but mulog is patient!

Out of curiosity (I don't know CW Logs that well, only the metrics side), under what conditions is a new log stream created?

BrunoBonacci commented 1 year ago

From CW best practices:

Creates a log stream for the specified log group. A log stream is a sequence of log events that originate from a single source, such as an application instance or a resource that is being monitored.

You have one log stream per application instance. So if you have the same application in multiple containers and multiple hosts, you will have 1 log stream for each instance of the process. µ/log follow this recommendation and creates a log stream when you initialize your application with µ/start-publisher!.

BrunoBonacci / mulog

Publish errors with Cloudwatch Logs #114