elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
72 stars 3.5k forks source link

Persistent Queues hang when ES server throws a MapperParsingException #8497

Open randude opened 7 years ago

randude commented 7 years ago

@original-brownbear Version: Logstash 6 rc1 Operating System: ubuntu 16.04

I was able to reproduce this and fix quite easily but it took me soooooooooo long to find this. 1 pipeline gets a string json formatted and sends it to ES.

We have a log that is sent only on one machine, which explains why this didnt happen on other machines.

In that log, we send a large number, larger than 2^63 which ES can't parse: 2017-10-16T16:59:24,830][DEBUG][o.e.a.b.TransportShardBulkAction] [vaultive-2017.10.16][2] failed to execute bulk it org.elasticsearch.index.mapper.MapperParsingException: failed to parse [sync_old] at org.elasticsearch.index.mapper.FieldMapper.parse(FieldMapper.java:291) ~[elasticsearch-6.0.0-beta2.jar:6.0

This causes the PQ to hang immediately. I think it also messed up other pipelines. When I added a filter that drops logs that have such a large value before getting to the output, everything worked. If I send this field as a string it also works.

It should be rather easy to reproduce. Please let me know if you need more information.

original-brownbear commented 7 years ago

@randude thanks, looks like we got this reproduced at last then :) Will try to get to this today or tomorrow :)

original-brownbear commented 7 years ago

@randude do you have the DLQ activated? This looks like exactly the kind of event that should just go to the DLQ imo, if it doesn't looks like the bug is there.

randude commented 7 years ago

@original-brownbear No, i have this seeitng in logstash.yml: dead_letter_queue.enable: false

were you able to reproduce it?

randude commented 7 years ago

json.txt

@original-brownbear It seems to happen ONLY after it gets 5-7 logs lines same as the the one that appears on the file I uploaded. The ES exception was just a coincidence. I have no idea what is so special about this json input. I thought that the sync_old field is too fault when it's bigger than 2^63, but this isn't the case.

Some facts:

  1. When writing a filter : if [sync_old] { drop {} }
    everything works. When removing this 'if', immediately it doesnt output data anymore.

  2. I don't have enough tools to debug this, this is obliviously something related to PQ because when it is disabled, everything works just fine even with the "suspicious" log.

  3. I tried to output to a file and then to ES. The file output consists of only the suspicious logs. After 5-7 lines, both outputs stopped writing data. Example: output { if [ sync_old] { file {} } elasticsearch {} }

  4. When I defined it as if else, everything was working fine, both outputs were working.

output { if [ sync_old] { file {} } else elasticsearch {} }

randude commented 7 years ago

@original-brownbear elasticsearch_template.txt This is the template that refers to the sync_ fields.

original-brownbear commented 7 years ago

@randude sorry was a little busy the last few days and could only get back to this now.

So if sync_old is to blame here, that should be easy to track down :) This is what I find interesting:

we send a large number, larger than 2^63

This shouldn't be possible with PQ in your ES version, there is a bug in PQ in that release that breaks handling of large number serialization on the LS end. Could it be that sync_old exceeds 2 ^ 63 breaking the PQ before we even get to the ES output? (In that case you should have some error about deserialization of byte[] or B[ in your logs).

Or is sync_old dynamically created by a filter in your pipeline?

randude commented 7 years ago

sync_old is not created by logstas's filters,, it's received from a different process via tcp. I didn't notice any error in the logstash.log. Is it written only in debug level?

Will the bug you mentioned be solved in the next version?

original-brownbear commented 7 years ago

@randude yes, we identified the issue in the JrJackson/Jackson dependency and it'll be upgraded to a working version for all incoming releases :)

randude commented 7 years ago

Awesome, I look forward to upgrade. I really hope that it will close these issue once and for all.