Closed handerpeder closed 2 years ago
Hi @handerpeder,
The underlying problem is that you are sending to Elasticsearch the same field name but with two different data types. This is essentially an Elasticsearch limitation which I can't do anything about it.
To limit the issue I perform Field name mangling so that for most of the common case you don't have indexing errors. However in the case, from Elasticsearch point of view, you are sending arrays of different types. Elasticsearch will only accept the records matching the index mapping, whether you explicitly defined it, or implicitly deduced by Elasticsearch during the indexing of the first record of this type.
I would recommend using two different field names, you can do this by changing the entries in the code or by adding a Custom Transformation function for the Elasticsearch publisher.
Expected: 1 publisher error for the bad record
Every time the publisher attempts to post the records and gets a failure it generates a publisher-error event. Eventually the record will be "pushed-out" of the circular buffer as described in mulog internals and the publisher will recover the normal operation. Unfortunately, during this time you will receive an error every few seconds (probably in mulog-0.10.0 I will improve this)
Actual: a cascade of errors which eventually brings mulog down.
mulog publishers are self-healing and, in cases like this, they will eventually recover. It might take a long time, this depends on the size of the publisher's buffer used and the rate of events. I'm not aware of any way in which mulog can be brought down, if you are experiencing crashes caused by mulog please do let me know.
I'm not sure I can do anything here, please do let me know if there is anything else or I misunderstood the problem you are facing
mulog publishers are self-healing and, in cases like this, they will eventually recover.
I think this is not happening. After experiencing an issue with a bad record ( I used an index field type error as an example ), mulog seems to eventually be brought down.
My ingest count looks like this:
And if we filter just by publisher errors
So there are a lot of errors, then mulog stops publishing.
I've attached an example error. Seems mulog wrongly marks successful records as not being sent and attempts to retransmit them. I think the steps I've marked out above allows you to reproduce this.
I'm also on the mulog clojurians channel if you prefer to chat there.
At your level of event rate (~500 events/h) it will take ~40h to drop the bad record and recover. With a smaller buffer you should see it recovering sooner.
Thanks for the error file, it is super useful.
the error returned by Elasticsearch is:
{:create
{:_index ".ds-logs-planner-2022.06.25-000002",
:_id "4jy8FRdiVe_NFDZheSwG79BOKwnBzf2c",
:status 409,
:error {:type "version_conflict_engine_exception",
:reason "[4jy8FRdiVe_NFDZheSwG79BOKwnBzf2c]: version conflict, document already exists (current version [1])",
:index_uuid "n-QjNG-GSdqJAWR3u1ET2Q",
:shard "0",
:index ".ds-logs-planner-2022.06.25-000002"}}}
From this I can see that you are using Data Streams. The strange (and unexpected) this is the first line :create
. This should correspond to the action sent by the client, and :_id
is the :mulog/trace-id
. We use this to ensure idempotency.
Now, in order to ensure that in case of failure during the publishing of the events, no duplicate events are produced, mulog sends the :_id
along with all the records so that a re-transmission will just update the same record.
The action sent by the publisher is index
, now I'm surprised that Elasticsearch thinks we have sent create
as create is not idempotent.
I will investigate this further.
Looking at the source code, the publisher sends :create
Looking at the code and the Elasticsearch documentation it should be possible to just use the index
action as well.
Could you please update your publisher configuration as follow:
{:type :elasticsearch
:index-pattern "'my-stream'" ;; SINGLE QUOTES ARE IMPORTANT
:url "https://my.elk"
:http-opts opts}
The idea is to trick the publisher into use the index
action rather than the create
which is not idempotent.
Emitted the following events
(doseq [_ (range 10)]
(u/log :my/event :val [1 2 3]))
(doseq [_ (range 5)]
(u/log :my/event :val ["some" "text"]))
(doseq [_ (range 50)]
(u/log :my/event :val [1 2 3]))
Then sampled a document after 3 minutes.
Using :data-stream
stream_error.txt
Using :index-pattern
index_error.txt
First using :data-stream
, then restarted repl and used :index-pattern
.
I get the same amount of error records using both; ~12records/m containing errors. This seems consistent with your explanation of the intended behavior.
Also as you predicted, no create errors when using :index-pattern
.
That's fantastic, I'll make a fix and release it. thanks for trying things out
Thank you for your help.
The fix will be released with mulog-elasticsearch 0.9.0
shortly
Continuation of issue originally reported as #91
Repro:
Start a multi publisher
Post a bad record
Expected: 1 publisher error for the bad record Actual: a cascade of errors which eventually brings mulog down.