Possible corruption issues playing back / recording large amounts of data to elasticsearch

theblazehen commented 6 years ago

Recorded the bb demo from aalib, and approximately 1 minute in I get

Failure when receiving data from the peer
Failed reading the source at message #450

Doing a search for that id works, returning

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 9.943716,
    "hits" : [ {
      "_index" : "tlog-rsyslog",
      "_type" : "tlog",
      "_id" : "AWH7AKktWVTwCBUs43ZO",
      "_score" : 9.943716,
      "_source" : {
        "timestamp" : "2018-03-06T06:09:12.560816-05:00",
        "ver" : "2.2",
        "host" : "testscribery2.home.theblazehen.com",
        "rec" : "8f5e6578f473444bac4b74d71c457c2d-3d68-1a3c0",
        "user" : "testuser",
        "term" : "xterm-256color",
        "session" : 3,
        "id" : 450,
        "pos" : 37826,
        "timing" : "=178x65>1976",
        "in_txt" : "",
        "in_bin" : [ ],
        "out_txt" : "FC AD C4 7D 18 25 3A FD 78 49 57 A4 07 BB BC CC 81 10 60 C\u001B[52;2H10 E1 4F A4 D9 F3 B2 CD B7 EE ED 6E 7E 7E 0A A5 FF 6C 3B B9 F9 8D 5D 86 FB 3E FF 5F DE D8 CD 9D FC DA A7 AB EB 9F 0D 14 42 1D 8C 85 9F 36 CF 88 4B F9 33 94 DF 84 83 39 4B CD F7\u001B[53;2H70 E3 BA CA F3 69 AF 14 EE 6E 44 CB 7B 73 BB BC F9 1C FF BA 6B B2 FD B1 C7 1F D6 AE E1 F2 8B C6 E4 AA A2 EF FA 60 38 2E 0A 6C EC 80 0F B3 FC F3 3A 61 51 34 8A DD C4 DF 5C 33 BC\r\nA 43 B6 9F CE 98 2F E6 FD 1D CC 31 D2 8B 67 6F DF 6F B9 BC 7B 1B F5 FF 64 42 C7 5A C2 C0 25 03 CC E7 FB 8B 0D 64 CE DD E5 CA EC B8 A4 4A 94 83 D8 C8 AD 0D 93 F0 60 8F DF 05 27 0\r\nC F0 1E B1 03 62 DE 2C 0A 43 9F 81 8B EE 8F 30 4D 2E AA DC F4 3C B2 D5 D2 AB DB 57 F5 EA 5D EA DB C4 2B 36 6C D3 2F 25 73 B6 56 CA 66 F6 B4 DF 0E 22 69 58 8C F6 FE A3 C9 74 CC 54\u001B[56;1HA FF 0F 0F 8A A1 EF F9 F9 BF 1F 56 D1 DF F9 39 19 8D DC FA F9 3A 48 CB DF 60 B2 C4 B4 10 74 8B EF 92 AA C8 DF 4F B7 BF C8 18 44 99 CF 54 8E EC 5A 8C 0E DA D2 BB CA 63 5D DA DB 60\u001B[57;1H 6F 12 4A FA 2B A7 26 A8 AE B2 DE B9 5D 45 B2 8F FB A3 AB 14 DA BF 2D D8 E7 7F A3 7E 11 32 D0 0E AC E6 9B D0 9F 12 DE 23 87 CC BE 1B BF 1A F7 7E 55 55 0B 6E EA 4A FA 2F 85 BA 6B\u001B[K\r\nB8 F8 B5 DC AA BA E2 C8 F3 E6 EF F1 FB 80 FA 52 3D 83 E9 FD 38 FE D7 61 57 56 25 81 77 C1 F2 BB 9C 8A D5 81 87 E2 1E C5 EA A1 01 76 FB 6C 9E DD A0 4C F5 0F B7 BE 35 74 BD 74 1F 4\u001B[59;1H 00 19 50 1F BA D9 B7 49 6B AE CB 5A EA A6 77 BF 99 E6 8A 7B 9E 1B 25 6B 86 5F 30 C1 BA 2B C3 FA 88 4B C1 73 C1 C8 6B E9 2A 6C C5 EE F8 32 1F 1A BF CD B6 F5 F3 BD 94 27 AC C5 FC\u001B[K\r\nB6 E6 B6 D1 0E D5 B6 00 82 52 B4 EC C9 51 48 3B 7F 68 7C EF 9D 14 03 5\u001B[5C05 0A B3 60 66 22 71 6C 79 44 36 D9 ED 89 C0 FF B9 A4 4A 10 5D 1B ED 19 5B E2 B1 EA 1E 5F B8 5B A7 09 D\u001B[61;1H A6 4B 93 65 1F C5 FA D3 DC 10 BF 73 CD 75 3D 23 DC 2D 4A C1 8C F3 EC 18 A9 3D AF DC 5E 7E 47 D7 27 3F E4 A6 38 B9 F2 A4 18 46 CD C6 FA 42 BE DA 9B D5 DE D6 43 4C 4B 60 79 D8 31\u001B[K\r\nF3 54 5E B7 CE 9F 86 F0 AD 6C FA 39 E9 88 50 3B 7E A6 71 A3 F9 90 CB EA FA 78 D8 D5 C3",
        "out_bin" : [ ]
      }
    } ]
  }
}

I've tested with bulkmode on and off in the rsyslog config. Configuration of the server is RHEL 7, configured with the ansible-tlog playbook. Is it possible that there could be issues with the large amounts of output being recorded, or possibly certain escape sequences perhaps?

I ran bb with bb -driver slang. Hopefully that's enough information to reproduce the issue? Or is there anything else I can provide to help?

spbnick commented 6 years ago

I assume the problem appears on playback, not on recording. Is that right? The first error message in the output you quote belongs to libcurl. So, there's probably some communication problem between libcurl and Elasticsearch. Could it be that Elasticsearch is closing the connection for some reason? Could you perhaps capture the playback traffic and see what's actually going on?

theblazehen commented 6 years ago

Ah, yes. At the end it looks like elasticsearch duplicated the last message, then closes the connection (both the last 2 results are sort":[676]) however the _id differs. In previous results from elasticsearch it delivers the 10 results as requested in the query (size=10) with the sort value increasing by 1 each time, however for the last query it goes 671, 672, 673, 674,674, 675,675, 676,676.

Could this possibly be a bug in elasticsearch? I'm running on 2.4.6, what's the recommended version? I'm attaching the pcap, and will test on some newer versions as well. dump.pcap.gz

spbnick commented 6 years ago

Thank you for the trace! Yes, _id indeed differs. So, maybe Elasticsearch indexes those messages twice. Could you perhaps take a look at the rsyslog<->Elasticsearch conversation on the network and see if rsyslog retransmits the messages and if Elasticsearch is causing that with some responses?

theblazehen commented 6 years ago

So, this is interesting. I'll investigate into this further. Running ES 5 now, but I get the

Failure when receiving data from the peer
Failed reading the source at message #120

Error at different messages each time. I've had it at message 120, 200, 400, and 470.

Checked the pcap where I sent data from rsyslog -> elasticsearch, and there appears to be no duplicates around those message IDs, however I appear to have some duplicates at higher IDs

# curl -s 'scriberyxpack:9201/tlog-rsyslog/_search?size=10000&q=rec:f0522d4237e345d0a313b7689d10779e-351f-cc5f8' | jq . | grep '"id"' | cut -d '"' -f 3 | sed 's/..\(.*\)./\1/' | sort -n | uniq -d 
18886
18887
18888
18889
18890
18891

spbnick commented 6 years ago

I wonder if disconnection is simply something which Elasticsearch does after a while to avoid keeping connections open too long, and perhaps we should deal with that in tlog, e.g. reconnect. We need to look at Elasticsearch documentation and logs (if any) for clues.

The duplicate ID situation is indeed interesting. I would still strongly suspect rsyslog, but perhaps something is wrong with Elasticsearch too. I have no time to investigate this at the moment, but would welcome your research and patches. Thank you.

Scribery / tlog

Possible corruption issues playing back / recording large amounts of data to elasticsearch #187