Create parquet file of the following schema with 1.5 million rows:
var schemaScaler = new ParquetSchema(
new DataField<long>("a"),
new DataField<long>("b"),
new DataField<double>("c"),
new DataField<double>("d"),
new DataField<double>("e"),
new DataField<double>("f"),
new DataField<double>("g"),
new DataField<double>("h"),
new DataField<double>("i"),
new DataField<double>("j"),
new DataField<double>("k"),
new DataField<double>("l"),
new DataField<double>("m"),
new DataField<double>("n"),
new DataField<double>("o"),
new DataField<double>("p"),
new DataField<double>("q"),
new DataField<double>("r"),
new DataField<double>("s"),
new DataField<double>("t"),
new DataField<double>("u"),
new DataField<double>("v"),
new DataField<double>("w"),
new DataField<double>("x"),
new DataField<double>("y")
);
using (Stream fs = File.Open(parquetScalerFilePath, FileMode.Create))
using (ParquetWriter writer = await ParquetWriter.CreateAsync(schemaScaler, fs))
{
writer.CustomMetadata = new Dictionary<string, string>
{
["MainFilePath"] = @"A:\B\C\D\x.parquet",
};
writer.CompressionMethod = CompressionMethod.Zstd;
foreach (var chunk in data.Chunk(5000))
using (ParquetRowGroupWriter groupWriter = writer.CreateRowGroup())
{
//...
writeParquetColumn(chunk.Select(x => x.a).ToArray(), (DataField)schemaScaler[0], groupWriter);
writeParquetColumn(chunk.Select(x => x.b).ToArray(), (DataField)schemaScaler[1], groupWriter);
//...
}
}
The generated file can not be read back:
var table = await ParquetReader.ReadTableFromFileAsync("x.parquet");
Exception on this call is: System.InvalidOperationException: "don't know how to skip type Uuid"
Reading the file with PyArrow fails with:
{OSError}OSError("Couldn't deserialize thrift: don't know what type: \x0e\nDeserializing page header failed.\n")
But, ParquetViewer V. 2.8.0.3 can read the file without any problems.
This problem was introduced with Parquet.Net V. 4.18.0
Running the identical parquet generation code with V. 4.17.0 results in files that can be read back using Parquet.Net and PyArrow.
As a side node: Fastparquet complains about: NotImplementedError: Encoding 5 for both Parquet.Net versions. Guess they have some work to do...
In case the problem is not reproducible on your side, I can upload my file and send you a PM with the link attached.
Library Version
4.18.0
OS
Windows and Linux
OS Architecture
64 bit
How to reproduce?
Create parquet file of the following schema with 1.5 million rows:
The generated file can not be read back:
Exception on this call is:
System.InvalidOperationException: "don't know how to skip type Uuid"
Reading the file with PyArrow fails with:
{OSError}OSError("Couldn't deserialize thrift: don't know what type: \x0e\nDeserializing page header failed.\n")
But, ParquetViewer V. 2.8.0.3 can read the file without any problems.
This problem was introduced with Parquet.Net V. 4.18.0 Running the identical parquet generation code with V. 4.17.0 results in files that can be read back using Parquet.Net and PyArrow.
As a side node: Fastparquet complains about:
NotImplementedError: Encoding 5
for both Parquet.Net versions. Guess they have some work to do...In case the problem is not reproducible on your side, I can upload my file and send you a PM with the link attached.
Failing test
No response