G-Research / ParquetSharp

ParquetSharp is a .NET library for reading and writing Apache Parquet files.
Apache License 2.0
182 stars 48 forks source link

Null value for column read while non-null expected #439

Closed rmadani closed 6 months ago

rmadani commented 6 months ago

Issue Description

I have a large parquet file with over 7M rows and over 200 columns in a single file group. All columns are of type string. While looping through the read process, the ReadBatch method starts returning null values for column 89 and onward. Is there any setting I am missing to read properly? Any help would be greatly appreciated.

Environment Information

Steps To Reproduce

int numColumns = fileReader.FileMetaData.NumColumns;
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup)
{
    using var rowGroupReader = fileReader.RowGroup(rowGroup);
    var groupNumRows = checked((int)rowGroupReader.MetaData.NumRows);

    string[] readBuffer = new string[2048];

    LogicalColumnReader<string> colReader = rowGroupReader.Column(0).LogicalReader<string>();
    while (colReader.HasNext)
    {
    colReader.ReadBatch(readBuffer);
    //save buffer in database

    for (int i = 1; i < numColumns; ++i)
    {
        //data viewed in debugger past column 89 is null
        rowGroupReader.Column(i).LogicalReader<string>().ReadBatch(readBuffer);
        //save buffer in database
    }
    }
}

Expected Behavior

return data available past column 89.

Additional Context (Optional)

No response

adamreeve commented 6 months ago

Hi @rmadani. Your code looks incorrect, because there's no guarantee that the same number of rows will be read from every other column as for column 0 for each ReadBatch call, and you're not checking the number of rows actually read after any of your ReadBatch calls. It may be less than 2048 even if you haven't yet reached the end of the row group. Something like this should work instead:

int numColumns = fileReader.FileMetaData.NumColumns;
string[] readBuffer = new string[2048];

for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup)
{
    using var rowGroupReader = fileReader.RowGroup(rowGroup);
    var groupNumRows = checked((int)rowGroupReader.MetaData.NumRows);

    for (int i = 0; i < numColumns; ++i)
    {
        using var colReader = rowGroupReader.Column(i).LogicalReader<string>();
        while (colReader.HasNext)
        {
            var numRowsRead = colReader.ReadBatch(readBuffer);
            //save numRowsRead values from readBuffer
        }
    }
}
rmadani commented 6 months ago

thanks for your reply. I assumed that parquet files are similar to a SQL tables, i.e. if column 0 returns for instance 1000 data values, the same would be correct for all other column read batches.

Your sample code reads each column to the end (over 7M data values in my case) to the end, before going to the next column. Is there any way to read 2048 rows for all columns at a time, process and then go onto the next 2048 rows?

thank you in advance.

adamreeve commented 6 months ago

Yes, you just need to keep calling ReadBatch until you fill the read buffer to make sure you read 2048 rows for each column, unless you're at the end of the row group. Something like this should work:

int numColumns = fileReader.FileMetaData.NumColumns;
string[] readBuffer = new string[2048];

for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup)
{
    using var rowGroupReader = fileReader.RowGroup(rowGroup);
    var groupNumRows = checked((int)rowGroupReader.MetaData.NumRows);

    var columnReaders = new LogicalColumnReader<string>[numColumns];
    for (int i = 0; i < numColumns; ++i)
    {
        columnReaders[i] = rowGroupReader.Column(i).LogicalReader<string>();
    }

    var groupRowsRead = 0;
    while (groupRowsRead < groupNumRows)
    {
        var rowsToRead = Math.Min(groupNumRows - groupRowsRead, readBuffer.Length);
        for (int i = 0; i < numColumns; ++i)
        {
            var columnRowsRead = 0;
            while (columnReaders[i].HasNext && columnRowsRead < rowsToRead)
            {
                columnRowsRead += columnReaders[i].ReadBatch(readBuffer.AsSpan(columnRowsRead));
            }

            if (columnRowsRead != rowsToRead)
            {
                throw new Exception(
                    $"Expected to read {rowsToRead} for column {i} but read {columnRowsRead}");
            }

            //save rowsToRead values from readBuffer
        }

        groupRowsRead += rowsToRead;
    }
}
adamreeve commented 6 months ago

Looking some more into this, I think my initial explanation was incorrect, because we do always try to fill the read buffer when LogicalColumnReader.ReadBatch is called, so each column should be reading the full 2048 values: https://github.com/G-Research/ParquetSharp/blob/f1499744ac87231a7b8c80778f778873ef7332d2/csharp/LogicalBatchReader/ScalarReader.cs#L24-L27

Maybe the problem was caused by creating new logical readers for column indices > 1, rather than reusing readers for the entire row group, as making a new ColumnReader will always start reading from the beginning of the row group. I'm not sure why that would lead to seeing null values though.

The code above creates an array of LogicalColumnReader<string>, so only creates one LogicalColumnReader per column in a row group and doesn't have that problem.

rmadani commented 6 months ago

thank you, That's exactly what I did and it seems to work now. HasNext appears to be retained for each column reader.

adamreeve commented 6 months ago

OK great, I will close this as answered then.