aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
637 stars 153 forks source link

[BUG]: Type mismatch silently leads to incorrect de-serialization result #573

Closed dkotov closed 3 days ago

dkotov commented 4 days ago

Library Version

5.0.2

OS

Ubuntu Linux 22.04

OS Architecture

64 bit

How to reproduce?

  1. create parquet file with Int64 non-zero values
  2. define class for de-serialization with this property defined with Int32 type
  3. try to de-serialize values from the file into the class objects

Result: de-serialization completes without errors/warnings but values in objects don't match values in the file, e.g. 3 instead of 1.

Failing test

No response

aloneguid commented 3 days ago

Not reproducible, here is the test proving it:


        class EdgeCaseInt32 {
            public int Id { get; set; }
        }

        [Fact]
        public async Task EdgeCase_rawint64_to_classInt32() {
            var schema = new ParquetSchema(new DataField<long>("Id"));
            using var ms = new MemoryStream();
            using(ParquetWriter writer = await ParquetWriter.CreateAsync(schema, ms)) {
                using(ParquetRowGroupWriter rg = writer.CreateRowGroup()) {
                    await rg.WriteColumnAsync(new DataColumn(schema.DataFields[0], new long[] { 1, 2, 3 }));
                }
            }
            ms.Position = 0;

            IList<EdgeCaseInt32> data = await ParquetSerializer.DeserializeAsync<EdgeCaseInt32>(ms);

            Assert.Equal(1, data[0].Id);
            Assert.Equal(2, data[1].Id);
            Assert.Equal(3, data[2].Id);

        }

Feel free to reopen with reproducible test if I didn't understand you correctly.

dkotov commented 3 days ago

You are right, I did miss one detail: to reproduce this issue one needs a file with the following schema (without logical type/hint):

message spark_schema {
  optional int64 Id;
}

If I get it correctly, the provided test actually replicates the following schema (with logical type/hint):

message root {
  required int64 Id (INTEGER(64,true));
}

That's why it doesn't reproduce the issue. Unfortunately, I'm not sure how to simulate the first schema with Parquet.Net code - need your help here.

But I attached a sample file with the following content:

parquet-tools cat ./no-logical-type.parquet
# output
[{"Id":1},{"Id":2},{"Id":3}]

And here is an integration test for it:

class EdgeCaseInt32 {
    public int? Id { get; set; }
}

[Fact]
public async Task EdgeCase_fileInt64_to_classInt32() {
    IList<EdgeCaseInt32> data = await ParquetSerializer.DeserializeAsync<EdgeCaseInt32>("no-logical-type.parquet");

    Assert.Equal(1, data[0].Id); // Actual: 1 - SUCCESS
    Assert.Equal(2, data[1].Id); // Actual: 0 - FAILURE
    Assert.Equal(3, data[2].Id); // Actual: 2 - FAILURE
}
dkotov commented 3 days ago

@aloneguid please reopen the issue on my behalf as I don't have required permissions to do it. thanks!