FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
316 stars 136 forks source link

CBOR parser fails for `nextTextValue()` case for some reason (`float` decoding) #372

Closed cowtowncoder closed 1 year ago

cowtowncoder commented 1 year ago

After seemingly unrelated change to jackson-core (see https://github.com/FasterXML/jackson-core/pull/983 ), one CBOR test case started failing in ParserNextXxxTest. It looks like END_ARRAY token comes from CBORParser.nextTextValue() as if maybe some initialization is missing, or ... something. But I think this is related to #347; I think it's not a new bug but just being exposed by some change in sequence of actions.

@here-abarany I was wondering if you could help here?

cowtowncoder commented 1 year ago

One oddity is that of where START_ARRAY comes from. But looking at _handleTaggedArray() I wonder if the issue is actually that we are now encoding BigDecimal from JSON (to retain accuracy) and it's a problem with BigDecimal handling ("bigfloat").

here-abarany commented 1 year ago

@cowtowncoder When stepping through the test case, it's invoking the BigDecimal overload and corresponding encoding as two integers. It seems like the underlying problem here is it doesn't understand that the value "0.5" can be represented exactly as a floating point number when parsing the input and forwarding the events.

cowtowncoder commented 1 year ago

@here-abarany That behavior is actually expected due to recent change I linked to: trying to figure out that 0.5 is exact representation (when reading from JSON) is not free and so copy will now use conservative logic to ensure values are not truncated.

To me it seems more like problem of not exposing BigDecimal ("bigfloat") value as floating-point number: it should not be exposed as an array.

Is there not a test for simple case of using CBORGenerator.writeNumber(bigDecimal) to write doc, then reading it back? That definitely is expected to work.

EDIT: in ParserNumbersTest there are basic tests for writing BigDecimal, testBigDecimalType() and testBigDecimalType2(). So at least for normal access it works.

cowtowncoder commented 1 year ago

@here-abarany Also: if I comment out nextTextValue() from CBORParser -- which will use naive implementation from base class -- test passes. So I am fairly confident there is just something in implementation of nextTextValue() that is missing or different. I think you fixed a few similar issues when implementing this feature.

If worst comes to worst we could of course just remove optimized nextTextValue(): I am not 100% sure how much performance benefit it gives (esp. since its usage is not common AFAIK). But before considering that maybe there's something simple that I didn't notice.

cowtowncoder commented 1 year ago

cc @pjfanning ^^^ this has discussion on issue uncovered by the change. I think it's not caused by jackson-core changes but exposes something missing wrt CBORParser.nextTextValue() handling of "bigfloat" value (BigDecimal encoding)

here-abarany commented 1 year ago

@cowtowncoder I believe that this may indeed be an issue with the parser not properly seeing the tag on parse. I can verify that it is reading the tag, but it doesn't appear to be considering it when reading the array.

There is the separate question of whether it's desirable to use the BigDecimal representation in the output file: it does preserve full precision, but if it's consumed by any other parser it requires it implements this extension. If you're explicitly writing a BigDecimal yourself you can maybe accept that responsibility of "if I write it I make sure my consumers can read it", but if it's automatic it could cause hard to track down problems when distributing the file.

cowtowncoder commented 1 year ago

@here-abarany I think whether BigDecimal should be used in specific case is use-case specific, but if it is, it really should work. In this case it is a side-effect of test set up and not necessarily intentional.

Still, I don't see a reason why CBOR backend could and should not handle this translation: it works for simple value, it works for POJO value. And I suspect it should work for "BigDecimal in Array" too.

And like I said, it DOES work if nextTextValue() call is either not used (test explicitly focuses on that), or its implementation defaults to using nextToken(). I don't think there's anything fundamentally problematic.

here-abarany commented 1 year ago

@cowtowncoder The problem is that nextTextValue() calls createChildArrayContext() rather than _handleTaggedArray(). It looks like other tagged types, such as byte strings, would be affected as well. From what I can tell this was an issue previous to my StringRef changes. nextTokenValue() is overall largely a duplicate and prone to these kinds of errors: I'd recommend removing it if that's an option.

here-abarany commented 1 year ago

I think whether BigDecimal should be used in specific case is use-case specific, but if it is, it really should work

I agree, this is a topic independent of the test failure which looks to be a bug. However, if the current behavior of BigDecimal would mean a typical use case of "pass a JsonParser to a CBORGenerator to stream-convert to CBOR" would convert floating point values into BigDecimal, that would be concerning for the portability of the resulting file.

cowtowncoder commented 1 year ago

Yeah... the idea for nextTextValue() is to allow optimized decoding for cases where someone wants hyper-optimization, knowing that -- for example -- certain property always has textual value matching name (or is for String array etc). It may be removed.

cowtowncoder commented 1 year ago

I think whether BigDecimal should be used in specific case is use-case specific, but if it is, it really should work

I agree, this is a topic independent of the test failure which looks to be a bug. However, if the current behavior of BigDecimal would mean a typical use case of "pass a JsonParser to a CBORGenerator to stream-convert to CBOR" would convert floating point values into BigDecimal, that would be concerning for the portability of the resulting file.

Right.

Damned if you, damned if you don't. Correctness, performance etc.

But note that there is no real conversion -- JSON has textual numbers, not double or BigDecimal. Question is that of which representation to use. And for most users, I think, correctness (not losing accuracy) is of big importance. This is the reason why it is often necessary to default to use of BigDecimal over lossy (from 10-based textual representation) binary floating-point representations.

Still, it may be necessary to offer some configurability.

here-abarany commented 1 year ago

I can at least give some insight as for our use case.

While the vast majority of our company as a whole is a Java-based ecosystem, most of the performance intensive code by my team at least is implemented with C++ native code. My initial CBOR work and testing was actually with a C++ implementation built on top of RapidJSON. My interest in getting these features in place for Jackson were so this could be adopted more broadly in our greater software ecosystem.

One of the big benefits of CBOR is being able to stream-convert between text JSON and CBOR easily. All of our usage of double fields (whether it's Java or C++) is as an actual double, so representing as a BigDecimal isn't of particular use to us either way and would invoke a performance penalty. However, even more concerning is if the default implementation of converting a JSON file to CBOR would write a BigDecimal encoded as an array, we would be unable to parse it in our C++ tools without extra modifications to support it.

I agree others may prefer the BigDecimal and avoid precision loss, but this is at least one real-world example to the contrary. It would certainly make our jobs a lot harder if there's no way to change the behavior. Whether it's better to handle this at the parser side (deciding whether to prefer creating a BigDecimal or a double) or on the generator side (converting a BigDecimal to a double if the extension isn't enabled), or what the default is, I can't say. I will say though that having the new behavior enabled by default may be more likely to cause regressions for at least CBOR, and potentially for other binary formats as well if the encoding changes.

here-abarany commented 1 year ago

It's also worth noting that many of our files contain a huge number of double values. This could result in a significant difference in file size if we were to do a stream convert JSON -> CBOR vs. serializing from an in-memory object representation to CBOR, where we'd be explicitly writing doubles, even though we would want the end result to be roughly the same. (this was a big motivation for adding the WRITE_MINIMAL_DOUBLES feature)

cowtowncoder commented 1 year ago

@here-abarany Ok. We'll have to think about this bit more... timing is both good and bad, wrt 2.15.0 release candidates. But I think I'll have to first fix the decoding problem here.

Also it'd be great if you could try out latest version (2.15.0-rc2 first; maybe snapshot then). I am not sure you'd be using code path changed in this particular case, for example: all explicit calls to write will use intended type (so double is not coerced into BigDecimal unless specific feature is enabled).

here-abarany commented 1 year ago

@cowtowncoder especially since this is so close to a release, maybe a more conservative approach is warranted to avoid minimal change in behavior by default? Based on the changes that were submitted, that would mean an opt-in feature for JsonGenerator to use the new code path for numbers. If you feel that preserving precision would be the more common desired use case, this could potentially be changed to be an opt-out feature for the 3.0 release, so those who would want to keep double precision numbers would have to explicitly disable it.

Currently I don't believe we have any real-world test cases with Jackson and CBOR at the moment, we have some initial work underway to prepare for integration but nothing concrete right now. I'm definitely keeping my eye on it and trying to push through adoption on our end.

cowtowncoder commented 1 year ago

This is a completely unexpected complication to me, unfortunately.

One thing that'd be great, first, is to see if your use is actually affected: it depends on whether you are using JsonParser.copyCurrentEvent() / -Structure() or not.

EDIT: sorry, should have read your comment first. No actual usage so cannot quite check.

cowtowncoder commented 1 year ago

@pjfanning Some food for thought. It's... not simple to figure this one out. I'd hate to be unable to fix a clear bug with buffering (where accuracy is lost). But since code is in JSON handling side, it cannot really be easily overridden on CBOR (well, I guess it can).... actually maybe that's the way, to refactor parts of copyCurrentEvent() to call separate copyCurrentFP() which is overridable.

here-abarany commented 1 year ago

We definitely won't be immediately impacted by this given that it isn't integrated yet, and our immediate concern will be configuring which JsonParser and JsonGenerator types will handle reading and writing of data with the internal object model. I've envisioned use cases such as storing data as CBOR at rest while dynamically converting to/from JSON if text is explicitly requested. These are mostly theoretical at this point, but does at least demonstrate a potential pitfall.

My biggest motivation for suggesting the more conservative approach is with such a widely used library, any change in behavior has the potential to cause regressions for clients. This is probably not an issue for text formats, though for the binary formats it could be more dangerous as it changes the fundamental representation of data. While I could see situations such as preferring a more compact representation for Smile being useful, if the primary concern is portability for CBOR files then having a copyCurrentFP() overridable function being a viable workaround, whether it simply restores the previous behavior for the immediate situation or has an option for more control. (though I suppose adding an option is the more dangerous option as it limits your flexibility for changing the interface later)

Another option is to provide an overload to JsonGenerator.copyCurrentEvent() that allows you to pass it options to control the behavior, allowing you to make that decision at the callsite without having to modify the options of the generator itself.

cowtowncoder commented 1 year ago

@here-abarany I really appreciate your feedback here. And one thing I may have forgotten to emphasize is that the goal of changes was NOT to change copying from parser to generator but to actually change copying to TokenBuffer (in jackson-databind) used for buffering content -- but it happens to use JsonGenerator as interface. So it is possible we may figure out bit different route. I hadn't really considered format conversion use case, although was aware of it being an obvious thing (although whether users use copyCurrentXxx() methods or roll their own I don't know -- probably both methods are used).

And like I said, had I realized scope of changes here, causing problems you point out (which are completely valid), I wouldn't have proceeded. And may revert changes from 2.15 even: but I first want to think things through, considering various aspects.

cowtowncoder commented 1 year ago

@here-abarany So, I went ahead and essentially reverted the change via:

https://github.com/FasterXML/jackson-core/issues/984

but also introduced a way to "opt-in" to exact copying sematnics with new copyCurrentEventExact() method. (there's still question of whether to also add copyCurrentStructureExact() but that's optional). This is partly as I realized that TokenBuffer fixes were not relying on changes to copyCurrentEvent().

Thank you once again for pointing out consequences of initial changes.

here-abarany commented 1 year ago

@cowtowncoder thanks! This approach looks good on my end.

cowtowncoder commented 1 year ago

I think this has been resolved, partly due to simplifying CBORParser.nextTextValue().