aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[BUG]: Getting different length for keyColumn and valueColumn of a partition column #489

Open shamimashik opened 3 months ago

shamimashik commented 3 months ago

Library Version

4.23.4

OS

Windows

OS Architecture

64 bit

How to reproduce?

I'm seeing a difference in DataColumn.Data.Length for the keyColumn and the valueColumn of the partitionValues column.

Here's the paths I'm using - keyPath: "add/partitionValues/key_value/key" valuePath: "add/partitionValues/key_value/value"

For the keyPath, I'm getting 64865 as the DataColumn.Data.Length whereas valuePath returns 64867.

Note that this issue was not present in version 3.10.0

Failing test

Code that I used to verify the issue: 

private async Task<DataColumn[]> ReadParquetMyFileAsync(bool treatByteArrayAsString) {
    List<DataColumn> dataColumns = new List<DataColumn>();
    string name = "<filename>.checkpoint.parquet";
    string keyPath = "add/partitionValues/key_value/key";
    string valuePath = "add/partitionValues/key_value/value";

    using(Stream s = OpenTestFile(name)) {
        using(ParquetReader pr = await ParquetReader.CreateAsync(
            s, new ParquetOptions { TreatByteArrayAsString = treatByteArrayAsString })) {
            DataField[] dataFields = pr.Schema.GetDataFields();
            Dictionary<string, DataField> dataFieldMapping = this.RetrieveDataFieldMapping(dataFields);
            for(int i = 0; i < pr.RowGroupCount; ++i) {
                using ParquetRowGroupReader groupReader = pr.OpenRowGroupReader(i);
                if(dataFieldMapping.TryGetValue(keyPath, out DataField keyField) &&
                    dataFieldMapping.TryGetValue(valuePath, out DataField valueField)) {
                    DataColumn keyColumn = await groupReader.ReadColumnAsync(keyField);
                    DataColumn valueColumn = await groupReader.ReadColumnAsync(valueField);
                    Array keyColumnData = keyColumn.Data;
                    Array valueColumnData = valueColumn.Data;
                    dataColumns.Add(keyColumn);
                    dataColumns.Add(valueColumn);

                    string result = string.Empty;
                    for(int dataIndex = 0; dataIndex < keyColumn.Data.Length; ++dataIndex) {
                        string key = keyColumnData.GetValue(dataIndex).ToString();
                        string val = valueColumnData.GetValue(dataIndex) == null ? "null" : valueColumnData.GetValue(dataIndex).ToString();
                        result += "[" + (dataIndex) + "] " + key + ": " + val + "\n";
                    }
                    Console.WriteLine(result);
                }
            }

            return dataColumns.ToArray();
        }
    }
}
mukunku commented 2 months ago

I noticed this a while ago too. I thought that it was intentional to save space if all the values after a certain index are null.

I wrote my code the following way to accommodate the key and value array lengths not being the same.

https://github.com/mukunku/ParquetViewer/blob/77c70c9d2a95c96de28c5701717c08c362d8eb13/src/ParquetViewer.Engine/ParquetEngine.Processor.cs#L252-L273