Closed rmadani closed 7 months ago
Hi @rmadani. Your code looks incorrect, because there's no guarantee that the same number of rows will be read from every other column as for column 0 for each ReadBatch
call, and you're not checking the number of rows actually read after any of your ReadBatch
calls. It may be less than 2048 even if you haven't yet reached the end of the row group. Something like this should work instead:
int numColumns = fileReader.FileMetaData.NumColumns;
string[] readBuffer = new string[2048];
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup)
{
using var rowGroupReader = fileReader.RowGroup(rowGroup);
var groupNumRows = checked((int)rowGroupReader.MetaData.NumRows);
for (int i = 0; i < numColumns; ++i)
{
using var colReader = rowGroupReader.Column(i).LogicalReader<string>();
while (colReader.HasNext)
{
var numRowsRead = colReader.ReadBatch(readBuffer);
//save numRowsRead values from readBuffer
}
}
}
thanks for your reply. I assumed that parquet files are similar to a SQL tables, i.e. if column 0 returns for instance 1000 data values, the same would be correct for all other column read batches.
Your sample code reads each column to the end (over 7M data values in my case) to the end, before going to the next column. Is there any way to read 2048 rows for all columns at a time, process and then go onto the next 2048 rows?
thank you in advance.
Yes, you just need to keep calling ReadBatch
until you fill the read buffer to make sure you read 2048 rows for each column, unless you're at the end of the row group. Something like this should work:
int numColumns = fileReader.FileMetaData.NumColumns;
string[] readBuffer = new string[2048];
for (int rowGroup = 0; rowGroup < fileReader.FileMetaData.NumRowGroups; ++rowGroup)
{
using var rowGroupReader = fileReader.RowGroup(rowGroup);
var groupNumRows = checked((int)rowGroupReader.MetaData.NumRows);
var columnReaders = new LogicalColumnReader<string>[numColumns];
for (int i = 0; i < numColumns; ++i)
{
columnReaders[i] = rowGroupReader.Column(i).LogicalReader<string>();
}
var groupRowsRead = 0;
while (groupRowsRead < groupNumRows)
{
var rowsToRead = Math.Min(groupNumRows - groupRowsRead, readBuffer.Length);
for (int i = 0; i < numColumns; ++i)
{
var columnRowsRead = 0;
while (columnReaders[i].HasNext && columnRowsRead < rowsToRead)
{
columnRowsRead += columnReaders[i].ReadBatch(readBuffer.AsSpan(columnRowsRead));
}
if (columnRowsRead != rowsToRead)
{
throw new Exception(
$"Expected to read {rowsToRead} for column {i} but read {columnRowsRead}");
}
//save rowsToRead values from readBuffer
}
groupRowsRead += rowsToRead;
}
}
Looking some more into this, I think my initial explanation was incorrect, because we do always try to fill the read buffer when LogicalColumnReader.ReadBatch
is called, so each column should be reading the full 2048 values: https://github.com/G-Research/ParquetSharp/blob/f1499744ac87231a7b8c80778f778873ef7332d2/csharp/LogicalBatchReader/ScalarReader.cs#L24-L27
Maybe the problem was caused by creating new logical readers for column indices > 1, rather than reusing readers for the entire row group, as making a new ColumnReader
will always start reading from the beginning of the row group. I'm not sure why that would lead to seeing null values though.
The code above creates an array of LogicalColumnReader<string>
, so only creates one LogicalColumnReader
per column in a row group and doesn't have that problem.
thank you, That's exactly what I did and it seems to work now. HasNext appears to be retained for each column reader.
OK great, I will close this as answered then.
Issue Description
I have a large parquet file with over 7M rows and over 200 columns in a single file group. All columns are of type string. While looping through the read process, the ReadBatch method starts returning null values for column 89 and onward. Is there any setting I am missing to read properly? Any help would be greatly appreciated.
Environment Information
Steps To Reproduce
Expected Behavior
return data available past column 89.
Additional Context (Optional)
No response