FasterXML / jackson-dataformats-binary

Uber-project for standard Jackson binary format backends: avro, cbor, ion, protobuf, smile
Apache License 2.0
316 stars 136 forks source link

ION - MappingIterator skip the first line #493

Closed loicmathieu closed 3 months ago

loicmathieu commented 6 months ago

When reading an ION file using a MappingIterator, the first line is omitted.

        // This file has 5 lines
        File testFile = resourceToFile("test.ion");

        ObjectMapper objectMapper = JacksonUtils.createIonObjectMapper();
        MappingIterator<Object> mappingIterator = objectMapper.readerFor(DEFAULT_TYPE_REFERENCE).readValues(new FileInputStream(testFile));

        // the first line is missing
        LongAdder count = new LongAdder();
        mappingIterator.forEachRemaining(obj -> {
            count.increment();
        });
        Assertions.assertEquals(5, count.sum()); // fails as count.sum() returns 4 and not 5 as the first line is missing

When reading in a more traditional approach the object mapper correctly read all 5 lines

Reproducer project, there is a test in it:
mapping-iterator.zip

yvrng commented 4 months ago

This only happens when rows are arrays, everything seems fine with objects:

cowtowncoder commented 4 months ago

Makes sense, state auto-detection wrt arrays is problematic as the first START_ARRAY can mean one of 2 things:

  1. Content is an Array of entries ("wrapper" array around all separate items to iterate over)
  2. Content is a sequence of Array items (no wrapper)

and so iterator has to choose one or the other. I think default is (1), meaning that for (2) behavior is incorrect.

In case (2), caller needs to create and initialize JsonParser to point to the START_ARRAY of the first item (skipping the "wrapper" START_ARRAY).

This is not Ion specific, I think, and is something that cannot be reliably resolved due to inherent ambiguity (and since JsonParser does not have look-ahead capability to check for nested START_ARRAYs, for example)

loicmathieu commented 3 months ago

@cowtowncoder do you mean that the MappingIterator implementation has an issue (for ION or more broadly) or that I use it the wrong way?

As your comment is about JsonParser which I don't use directly.

cowtowncoder commented 3 months ago

I mean that there is a fundamental ambiguity wrt token stream abstraction and data-binding that MappingIterator cannot solve, related to Java values for which serialization is Array. So whereas code you use works perfectly well for something represented as Objects (JSON or Ion), it takes a guess on seeing JsonToken.START_ARRAY, regarding how to consume that token. Basically seeing it, assumption is that you have an Array of Things (of whatever shape) -- but what you have is a Sequence/Stream of Things (of Array shape). So instead of seeing the first value (with first token JsonToken.START_ARRAY) it sees wrapper Array. ... at least this is my hypothesis.

If so, to make things work, you will need to construct a JsonParser -- in your case, IonParser which is subtype (naming of JsonParser is due to legacy reasons but it is shared base type for all formats) -- and advance it to that first JsonToken.START_ARRAY and now construct MappingIterator. This way iterator will assume that parser instance is pointing to the first token of the first value to be iterated over, NOT wrapped within an Array.

I hope above helps: I realize it's not the most readable explanation but hopefully helps unlock the problem.

loicmathieu commented 3 months ago

@cowtowncoder thanks for the explaination.

It works with the following changes:

        // This file has 5 lines
        File testFile = resourceToFile("test.ion");

        ObjectMapper objectMapper = JacksonUtils.createIonObjectMapper();
        JsonParser parser = objectMapper.createParser(new FileInputStream(testFile));
        parser.nextToken();
        MappingIterator<Object> mappingIterator = objectMapper.readerFor(DEFAULT_TYPE_REFERENCE).readValues(parser);

        // the first line is missing
        LongAdder count = new LongAdder();
        mappingIterator.forEachRemaining(obj -> {
            count.increment();
        });
        Assertions.assertEquals(5, count.sum());

As I understand it, there is nothing that can be done to make this works OOTB? So should I close this issue?

cowtowncoder commented 3 months ago

Correct: as far as I understand things, there is natural ambiguity that makes one of 2 alternative cases not work. Cases are:

  1. Array-wrapped content (like [ { "id":1}, {"id":2} ])
  2. Streaming sequence of Array-expressed values (like [1,2] [2,3] [4,5])

and upon seeing the initial [ (or its equivalent in Avro etc) it is not possible to know (without look-ahead; but more fundamentally, without being able to figure out representation of the value type) which case we have. So the tie-breaker is to assume it is always (1).

But this decision is only needed when JsonParser is "unitialized", i.e. does not point to a token (so parser.nextToken() is called). If there is current token, it is assume it MUST be the first token of the first value.

So I agree this should be closed since it does not appear that Ion backend specifically has issues of its own.

loicmathieu commented 3 months ago

Closing it, the detailed answer and reproducer / workaround example are enough to help someone else that have the same issue understand it and work around it.

Thanks a lot @cowtowncoder for the detailed explanation and your help.

cowtowncoder commented 3 months ago

You are wlecome @loicmathieu !