elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.2k stars 3.5k forks source link

Logstash converts integer value in scientific notation to float, causing mapping error in Elasticsearch #12398

Closed GrahamHannington closed 3 years ago

GrahamHannington commented 3 years ago

I originally posted this issue as a topic with the same title on the Elastic discussion forum.

Given the following input (these are snippets from the input JSON Lines):

"MEMLIMIT Size":0
...
"MEMLIMIT Size":0.8590E+10

Logstash (I'm using 7.9.3) outputs:

"MEMLIMIT Size" => 0

"MEMLIMIT Size" => 8590000000.0

I'm fine with the 0 output value.

However, I'm not fine with the trailing .0 on 8590000000.0, because:

I do not want to configure Logstash (or Elasticsearch, for that matter) to perform any special processing on the field "MEMLIMIT Size", because this is just one example of such a field. Other fields might also have integer values represented in scientific notation. I want to continue to use Elasticsearch's dynamic mapping based on the serialized input values.

Based on the following comment by Badger in the discussion forum:

logstash uses a third-party library to parse JSON. Jackson, I believe. So the bug, if it is a bug, would be in the third-party library.

I created an issue for the jackson-databind project: "Large integer in scientific notation converted to float (with trailing .0)".

A developer responded and—understandably, from their perspective—asked for a Jackson-only test case. At which point, I decided to open this issue, because:

(a) With apologies to Badger, I'm not 100% certain that this issue is caused by Jackson (because, to my discredit, I have not done that research myself).

(b) While this issue is important to me, I'm not sure my time is best spent separately installing Jackson and developing that test case myself.

(c) I thought that you—Logstash developers; in a wider sense, Elastic Stack developers—would be interested in this issue, because it's the combination of parsing of values by Logstash (Jackson) and dynamic mapping by Elasticsearch that is resulting in an error.

Finally, to my surprise, I can't find any existing Logstash issues that match this specific situation. Have I overlooked them? Am I the first user to run into problems forwarding large integers in scientific notation in JSON, with dynamic mapping in Elasticsearch? This doesn't seem like such an edge case to me, but perhaps other users eschew scientific notation and just pump out the zeros?

Your thoughts?

In case it occurs to you: I'm not directly responsible for the formatting of that 0.8590E+10 input value. I've also tried specifying the input value 859E+7 , but get the same output value as before with the unwanted trailing .0. (I understand that, while 0.8590E+10 is in scientific notation, 859E+7 isn't, because 859 is greater than 9; however, both values are allowed in JSON.)

GrahamHannington commented 3 years ago

Why I created this issue here, in this project

Before Badger pointed out to me Logstash's dependency on Jackson, I was considering creating an issue for the Logstash JSON plugin, because, as a Logstash user, that's where the problem I observed (trailing .0 in the output values) seemed to be happening.

However (and with apologies if I'm wrong about any of this), when I looked at the logstash-filter-json source code, I noticed that it refers to (requires) the logstash/json module, which is defined in this "root" Logstash project, and which refers to Jackson.

Hence, I created this issue here, in the project that directly refers to Jackson, rather than in that subordinate plugin project. If I've done the wrong thing, please feel free to correct me, and tell me to re-create this issue in that other project, or elsewhere.

GrahamHannington commented 3 years ago

While I acknowledge that there are far more significant issues in the world today, I'm slightly surprised that there's been no response to this one.

To recap:

In my experience, a positive exponent is used to serialize large integers in a compact form, without a long trail of zeros. That is, in my experience—I acknowledge this is by no means always the case—numbers with a positive exponent are integers.

What's the solution?

I don't have a comprehensive answer to this question.

I have my use cases, other users will have theirs. Some users might like the current behavior for the same reason I don't (mapping in Elasticsearch).

I can say, though, that I would strongly prefer any solution to be based on values rather than keys (field names). I don't want to have to identify specific keys as always having integer values. This is because, in my use case, I am dealing with thousands of fields. I deliberately rely on Elasticsearch dynamic mapping to assign data types based on field values. I do not want to have to explicitly identify the data type of every field. That would be onerous, and introduce undesirable change control issues; new fields get added fairly frequently.

A possibility

One possible solution that occurs to me: only output a number with a decimal point if it is necessary to preserve the precision of the input value.

For example, there is no need for a decimal point when expanding the following numbers with exponents (expanded values shown in parentheses):

3E+10 (30000000000) 3.141E+3 (3141) 3.141E+4 (31410)

However, the following numbers do need a decimal point:

3.14159E+3 (3141.59) 3.141590E+4 (31415.90; note the trailing 0 to preserve the precision indicated by the original; I'm open to discussing this)

I will admit to being ignorant of the processing overhead involved in determining whether or not a decimal point is necessary vs always (as now) serializing such values with a decimal point.

Thoughts, feedback, counterarguments welcome.

kares commented 3 years ago

most libraries treat scientific notation as floating point values - otherwise (as explained on Jackson's tracker) you need to check potentially every value manually. treating E format the same way regardless of the value is just common sense and I am not sure LS starting to behave differently is a good approach.

esp. since there's an easy fix: if you want integers just send in integers:

>> JSON.parse '{ "foo": 0.8590E+10 }'
=> {"foo"=>8590000000.0}
>> JSON.parse '{ "foo": 8590000000 }'
=> {"foo"=>8590000000}

or if the input is out of your control - you could always "integerize" e.g. using a LS Ruby filter with code => ...: event.to_map.each { |key, val| event.set(key, val.to_i) if val.is_a?(Float) && val.to_i == val }

GrahamHannington commented 3 years ago

@kares wrote:

most libraries treat scientific notation as floating point values

Yes. I get that checking each value would be relatively expensive. And I understand that changing this behavior might break things; perhaps, for example, for the same reason that I'm complaining about it, but from a different angle: some Logstash users might be relying on such numbers being expanded to a float. I'd still prefer the option, though, because indiscriminately appending that trailing .0 amounts, as I wrote in that jackson-databind issue, to confecting information. That irks me wherever I see it, no matter how pragmatic I try to be.

I am not sure LS starting to behave differently is a good approach.

Yes. I can see your point of view. To clarify (I should have done this earlier): I'm not promoting a change to the current default behavior.

As you point out, there are alternatives that I can pursue without any change to Logstash.

if you want integers just send in integers

Yes. I've already opened an issue for this on the (intranet-based) tracking system for the application that creates this feed.

you could always "integerize" e.g. using a LS Ruby filter

Yes. That's the solution that Badger provided me in the original discussion forum. I'll be using that pending the resolution of the tracking issue for that app.

Thanks for your input, I appreciate it: it's sane, measured, pragmatic. I'm neither surprised nor much disappointed. I thought it was worth at least raising this issue for fellow users considering representing large integers as numbers with exponents in JSON.

anonimnor commented 2 years ago

that all leaves logstash, beats, elastic and kibana as a big 'Gorbuha' thingie: based on an input, EVERY value (eq every log value) can be a string representation of '100500', then a string, like, 'myCode: 100500' and it fails being overpulled, like, say, a condom fails being pulled on a hammer.

leaving for a developer to handle, wich data fields he maps and how, (mind that those fields may vary, say, today you do not have a field, tomorrow it was added). according to the ELK logics, every hammer must have its own condom corresponding strict and sure. lame stack at that.