aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

string values are written as nulls in parquet file when data is set using Array SetValue Method #464

Closed rajbojja closed 5 months ago

rajbojja commented 5 months ago

Library Version

4.23.2

OS

Windows

OS Architecture

64 bit

How to reproduce?

Hello,

I have recently updated to the latest version of the Parquet.Net library and am facing issues that were previously working fine with older versions. I am attempting to run the following sample code, which generates a Parquet file, but the string values are now appearing as null.

Could you please review this and offer a solution to address the issue?

----CODE---- // create file schema

var schema = new ParquetSchema(new DataField<int?>("id"), new DataField<string>("city"));

//create data columns with schema metadata and the data you need 
var idColumn = new DataColumn(schema.DataFields[0], new int?[] { null, 2 });

var cityColumn = new DataColumn(schema.DataFields[1], new string[2]);

cityColumn.Data.SetValue("test", 0); 
cityColumn.Data.SetValue("test1", 1);

await using (Stream fileStream = System.IO.File.OpenWrite("C:\Users\Downloads\test.parquet"))
{
    using (ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, fileStream))
    { // create a new row group in the file
        using (ParquetRowGroupWriter groupWriter = parquetWriter.CreateRowGroup())
        {
            await groupWriter.WriteColumnAsync(idColumn);
            await groupWriter.WriteColumnAsync(cityColumn);
        }
    }
}

Failing test

No response

aloneguid commented 5 months ago

When you create a DataColumn, the data passed is an array of null strings:

var cityColumn = new DataColumn(schema.DataFields[1], new string[2]);

DataColumn caculates null flags in the constructor, hence all is written as nulls. To fix this, try passing actual data in the DataColumn.

rajbojja commented 5 months ago

We built generic methods to read and write parquet files for any kind of model object. To accomplish that we had to set the data after the initialization using the Column.Data SetValue method like below.

cityColumn.Data.SetValue("test1", 1);

It used to work fine with the previous versions and we are using these generic methods extensively in many places across the application. Passing actual data may not fit in our scenario and it would be a huge change across the application. Other data types like int, double work fine when setting the data using the above method even with null values, it appears the issue is with only string.

I got it working by making the DataField not nullable during initialization like this new DataField<string>("id", false) but that doesn't work if I pass null values.

Are there any alternatives to get this working or Do you have a sample generic implementation of Parquet Writer for any kind of model objects?

aloneguid commented 5 months ago

I appreciate your feedback, but I don't think I will modify the code, except for adding some comments, because this project is just something I do for fun in my spare time. If you really want this feature, you can try to submit a pull request for it.