Open theblazehen opened 6 years ago
I assume the problem appears on playback, not on recording. Is that right? The first error message in the output you quote belongs to libcurl. So, there's probably some communication problem between libcurl and Elasticsearch. Could it be that Elasticsearch is closing the connection for some reason? Could you perhaps capture the playback traffic and see what's actually going on?
Ah, yes. At the end it looks like elasticsearch duplicated the last message, then closes the connection (both the last 2 results are sort":[676]
) however the _id differs. In previous results from elasticsearch it delivers the 10 results as requested in the query (size=10
) with the sort value increasing by 1 each time, however for the last query it goes 671, 672, 673, 674,674, 675,675, 676,676.
Could this possibly be a bug in elasticsearch? I'm running on 2.4.6, what's the recommended version? I'm attaching the pcap, and will test on some newer versions as well. dump.pcap.gz
Thank you for the trace! Yes, _id
indeed differs. So, maybe Elasticsearch indexes those messages twice. Could you perhaps take a look at the rsyslog<->Elasticsearch conversation on the network and see if rsyslog retransmits the messages and if Elasticsearch is causing that with some responses?
So, this is interesting. I'll investigate into this further. Running ES 5 now, but I get the
Failure when receiving data from the peer
Failed reading the source at message #120
Error at different messages each time. I've had it at message 120, 200, 400, and 470.
Checked the pcap where I sent data from rsyslog -> elasticsearch, and there appears to be no duplicates around those message IDs, however I appear to have some duplicates at higher IDs
# curl -s 'scriberyxpack:9201/tlog-rsyslog/_search?size=10000&q=rec:f0522d4237e345d0a313b7689d10779e-351f-cc5f8' | jq . | grep '"id"' | cut -d '"' -f 3 | sed 's/..\(.*\)./\1/' | sort -n | uniq -d
18886
18887
18888
18889
18890
18891
I wonder if disconnection is simply something which Elasticsearch does after a while to avoid keeping connections open too long, and perhaps we should deal with that in tlog, e.g. reconnect. We need to look at Elasticsearch documentation and logs (if any) for clues.
The duplicate ID situation is indeed interesting. I would still strongly suspect rsyslog, but perhaps something is wrong with Elasticsearch too. I have no time to investigate this at the moment, but would welcome your research and patches. Thank you.
Recorded the
bb
demo from aalib, and approximately 1 minute in I getDoing a search for that id works, returning
I've tested with bulkmode on and off in the rsyslog config. Configuration of the server is RHEL 7, configured with the ansible-tlog playbook. Is it possible that there could be issues with the large amounts of output being recorded, or possibly certain escape sequences perhaps?
I ran
bb
withbb -driver slang
. Hopefully that's enough information to reproduce the issue? Or is there anything else I can provide to help?