Closed cch1 closed 1 year ago
µ/log uses itself to log internal errors. This one is to notify you that the publisher failed to publish the payload.
This could happen for various reasons, most commonly:
:transform
threw an exception{:rs
{:cognitect.anomalies/category :cognitect.anomalies/not-found,
:cognitect.anomalies/message "logs.us-east-1.amazonaws.com"},
:group-name "stt",
:stream-name "4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h"}
The above error seems to indicate that either the group stt
or the stream 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h
where not present in the us-east-1
region (is that your region?)
The thing that puzzles me is that you see this error during the publishing phase. When you initialize the publisher, it will create a log-stream, I would have expected this error during the initialization (start-publisher!) but not during the routing publish-events.
To get to the bottom of this please check
us-east-1
(just to make sure it is not trying to connect to a different region for some reason)stt
group is created 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h
do not get deleted while the process is trying to push more logslogs.us-east-1.amazonaws.com
which seems to suggest a DNS problem. Are you running in a container management system like Kubernetes, if so what base image are you using? Maybe Alpine? if so are you experiencing random network errors from time-to-time?
If so, there is a well-documented issue with Alpine Kernel which uses the Musl standard library
instead of glibc
. The issue has to do with how Musl handles the DNS requests.If you are experiencing other application errors consider this as a possible source. I hope this helps.
stt
was created long before this error appeared.Interestingly, the error messages I observed contains an ex-info with a stream
attribute of 4ribfKUjxBzT06vYyNvmDrmnW4j4d_7h
and a reason
attribute that indicates (in the cognitect AWS anomaly) that this stream does not exist. Yet this very message is published in that stream!! Is it possible that there is a race condition where the stream does not quite yet exist for publishing (provoking the error) and then moments later it does exist and can receive the internal mulog error?
This screenshot shows that the log stream that was reported as unavailable was available milliseconds before the reported error (it holds the st.system.params/SSM-parameters-read
event on line 21 as the first event in the stream). But the only other events in that stream are the internal mulog errors -perhaps mulog created a new stream as a result of the error condition.
That type of race condition is not possible in µ/log. In order to use a publisher, the publisher needs to be initialized, and the log-stream is created during the initialization (before it gets added).
There is one more scenario that would explain all the situations you pictured and the screenshot you sent me. Most of the APIs in AWS are eventually consistent, meaning that after the creation of a new resource, It could take a few seconds before it is ready to be used. I see this all the time during automated integration testing. If during the test I create a bucket, a Dynamo table, etc and then try to use it immediately I often get a 404 error (or other errors). In the test environment, I have to add some routine checks to verify that the resource I want to use is now available & ready to use.
Here could be the same, usually the publisher is initialized when the process starts up. During the publisher initialization, the log stream will be created, and if you immediately try to use it, and AWS happens to have a bad day and be under load, it could take a few seconds until the new resource is fully available.
I would suggest keeping an eye on it, if the error is what I described it should self-heal after a few seconds.
I would suggest keeping an eye on it, if the error is what I described it should self-heal after a few seconds. Well, it did heal! It apparently healed by immediately creating another log stream.
But that begs two questions: could mulog thrash by creating a stream, writing to it, failing to write (because eventual consistency), creating a new log stream, fail to write, ... In other words, should mulog have a backoff timer to prevent such thrashing?
My other concern is the loss of logged events -but I'm not sure how mulog recovers from a failure to write a message in this case (I'm aware that mulog drops excessive log messages, but this is different). So maybe it's a problem, maybe not. But I worry about it.
It doesn't create another log stream, it just wait for the one created to become available. no DDOS, no extra log streams created, no event lost in the meantime.
This is what happens:
Since µ/log received a 200 on the creation of the logstream, it will expect the log stream to eventually be there. During the whole time it will keep buffering events in memory and at regular interval it will try to publish the content of the buffer. If the publish attempt fails, it will just keep buffering and retry later. You can find out more about the internals here: https://cljdoc.org/d/com.brunobonacci/mulog/0.9.0/doc/%CE%BC-log-internals
That sounds very resilient. I assumed that the creation of a new stream was the reaction to a 404 -but mulog is patient!
Out of curiosity (I don't know CW Logs that well, only the metrics side), under what conditions is a new log stream created?
From CW best practices:
Creates a log stream for the specified log group. A log stream is a sequence of log events that originate from a single source, such as an application instance or a resource that is being monitored.
You have one log stream per application instance. So if you have the same application in multiple containers and multiple hosts, you will have 1 log stream for each instance of the process.
µ/log follow this recommendation and creates a log stream when you initialize your application with µ/start-publisher!
.
I have recently enabled the cloudwatch logs publisher. From time to time I see a spurt of logged errors like this one: