aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[BUG]: don't know how to skip type Uuid #445

Closed Pyroluk closed 6 months ago

Pyroluk commented 6 months ago

Library Version

4.18.0

OS

Windows and Linux

OS Architecture

64 bit

How to reproduce?

Create parquet file of the following schema with 1.5 million rows:

var schemaScaler = new ParquetSchema(
new DataField<long>("a"),
new DataField<long>("b"),
new DataField<double>("c"),
new DataField<double>("d"),
new DataField<double>("e"),
new DataField<double>("f"),
new DataField<double>("g"),
new DataField<double>("h"),
new DataField<double>("i"),
new DataField<double>("j"),
new DataField<double>("k"),
new DataField<double>("l"),
new DataField<double>("m"),
new DataField<double>("n"),
new DataField<double>("o"),
new DataField<double>("p"),
new DataField<double>("q"),
new DataField<double>("r"),
new DataField<double>("s"),
new DataField<double>("t"),
new DataField<double>("u"),
new DataField<double>("v"),
new DataField<double>("w"),
new DataField<double>("x"),
new DataField<double>("y")
);

using (Stream fs = File.Open(parquetScalerFilePath, FileMode.Create))
using (ParquetWriter writer = await ParquetWriter.CreateAsync(schemaScaler, fs))
{
    writer.CustomMetadata = new Dictionary<string, string>
    {
        ["MainFilePath"] = @"A:\B\C\D\x.parquet",
    };

    writer.CompressionMethod = CompressionMethod.Zstd;

    foreach (var chunk in data.Chunk(5000))
        using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup())
        {
            //...
            writeParquetColumn(chunk.Select(x => x.a).ToArray(), (DataField)schemaScaler[0], groupWriter);
        writeParquetColumn(chunk.Select(x => x.b).ToArray(), (DataField)schemaScaler[1], groupWriter);
        //...
        }
}

The generated file can not be read back:

var table = await ParquetReader.ReadTableFromFileAsync("x.parquet");

Exception on this call is: System.InvalidOperationException: "don't know how to skip type Uuid"

Reading the file with PyArrow fails with: {OSError}OSError("Couldn't deserialize thrift: don't know what type: \x0e\nDeserializing page header failed.\n")

But, ParquetViewer V. 2.8.0.3 can read the file without any problems.

This problem was introduced with Parquet.Net V. 4.18.0 Running the identical parquet generation code with V. 4.17.0 results in files that can be read back using Parquet.Net and PyArrow.

As a side node: Fastparquet complains about: NotImplementedError: Encoding 5 for both Parquet.Net versions. Guess they have some work to do...

In case the problem is not reproducible on your side, I can upload my file and send you a PM with the link attached.

Failing test

No response

aloneguid commented 6 months ago

This is a regression, please upgrade to 4.18.1.

Pyroluk commented 6 months ago

Ok, thank you.