aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[BUG]: If the data transferred to WriteColumnAsync is too large, an error occurs #494

Open ClaymorePlay opened 3 months ago

ClaymorePlay commented 3 months ago

Library version

last

OPERATING SYSTEMS

Linux, Windows

OS architecture

64 bit

How to reproduce?

  1. You are converting 1.5 million data from a database.
  2. A recording of the stream in the parquet opens.
  3. 1 group of lines is created (according to the condition, you need to write everything in 1)
  4. each column is written to WriteColumnAsync
  5. It gives an error: OverflowException: The array size has exceeded the supported range. in Microsoft.IO.RecyclableMemoryStream.ToArray() in /_/src/RecyclableMemoryStream.cs:line 820 in Parquet.File.DataColumnWriter.CompressAndWriteAsync(PageHeader ph, MemoryStream data, ColumnSizes cs, CancellationToken cancelToken) in Parquet.File.DataColumnWriter. WriteColumnAsync(ColumnChunk chunk, DataColumn column, SchemaElement tse, CancellationToken cancelToken) at Parquet.File.DataColumnWriter.WriteAsync(FieldPath fullPath, DataColumn column, CancellationToken cancelToken) at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, Dictionary2 custom Metadata, Can CellationTokenCancelToken) in Workers.ArchivingWorker.Jobs.ArchivingWebsocketLogsFull.WriteData(ParquetWriter writer, List1 response) in /src/Workers.ArchivingWorker/Jobs/ArchivingWebsocketLogsFull.cs:line 285 in Workers.ArchivingWorker.Jobs.ArchivingWebsocketLogsFull.ExecuteAsync(IServiceProvider serviceProvider) in /src/Workers.ArchivingWorker/Jobs/ArchivingWebsocketLogsFull.cs:line 250

For that matter, tell me if I can add and manage the first rowGroup after it is closed or the end of adding columns?

Failed test

Error:
OverflowException: The array size has exceeded the supported range. in Microsoft.IO.RecyclableMemoryStream.ToArray() in /_/src/RecyclableMemoryStream.cs:line 820 in Parquet.File.DataColumnWriter.CompressAndWriteAsync(PageHeader ph, MemoryStream data, ColumnSizes cs, CancellationToken cancelToken) in Parquet.File.DataColumnWriter. WriteColumnAsync(ColumnChunk chunk, DataColumn column, SchemaElement tse, CancellationToken cancelToken) at Parquet.File.DataColumnWriter.WriteAsync(FieldPath fullPath, DataColumn column, CancellationToken cancelToken) at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, Dictionary`2 custom Metadata, Can CellationTokenCancelToken) in Workers.ArchivingWorker.Jobs.ArchivingWebsocketLogsFull.WriteData(ParquetWriter writer, List`1 response) in /src/Workers.ArchivingWorker/Jobs/ArchivingWebsocketLogsFull.cs:line 285 in Workers.ArchivingWorker.Jobs.ArchivingWebsocketLogsFull.ExecuteAsync(IServiceProvider serviceProvider) in /src/Workers.ArchivingWorker/Jobs/ArchivingWebsocketLogsFull.cs:line 250

Send feedback Side panels

aloneguid commented 4 weeks ago

Sorry I don't understand the issue. Maybe providing a failing test will help i.e. "show me the code" ;)