aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
600 stars 151 forks source link

[BUG]: ArgumentOutOfRangeException thrown when trying to write column (interop type narrowing) #393

Open itayfisz opened 1 year ago

itayfisz commented 1 year ago

Library Version

4.16.2

OS

Linux

OS Architecture

64 bit

How to reproduce?

It's hard to reproduce - It happens rarely when I try to write column with many rows, with very long string values. After calling WriteColumnAsync, I get ArgumentOutOfRangeException - see full error below. Seems that in IronCompress\Iron.cs, the returned output variable "len" is negative, which is then used to initialize an array, and as a result causing the ArgumentOutOfRangeException.

bool ok = Native.compress(
                           compressOrDecompress,
                           (int)codec, inputPtr, input.Length, null, &len, level);

This is probably because the snappy::RawCompress, called in api.cpp, has an integer overflow. Here's a bug on Snappy about it, which has already been fixed.

I assume the solution would be to upgrade the Snappy version, but I'm not sure if it would just return a more informative error.

Exception Details
===================================
Exception Type: System.ArgumentOutOfRangeException
Message: Specified argument was out of the range of valid values. (Parameter 'minimumLength')
Actual Value: 
Param Name: minimumLength
Target Site: T[] Rent(Int32)
Help Link: 
Source: System.Private.CoreLib
HResult: -2146233086

Stack Trace Details 
-----------------------------------
   at System.Buffers.TlsOverPerCoreLockedStacksArrayPool`1.Rent(Int32 minimumLength)
   at IronCompress.Iron.NativeCompressOrDecompress(Boolean compressOrDecompress, Codec codec, ReadOnlySpan`1 input, CompressionLevel compressionLevel, Nullable`1 outputLength)
   at IronCompress.Iron.Compress(Codec codec, ReadOnlySpan`1 input, Nullable`1 outputLength, CompressionLevel compressionLevel)
   at Parquet.File.DataColumnWriter.CompressAndWriteAsync(PageHeader ph, MemoryStream data, ColumnSizes cs, CancellationToken cancellationToken)
   at Parquet.File.DataColumnWriter.WriteColumnAsync(ColumnChunk chunk, DataColumn column, SchemaElement tse, CancellationToken cancellationToken)
   at Parquet.File.DataColumnWriter.WriteAsync(FieldPath fullPath, DataColumn column, CancellationToken cancellationToken)
   at Parquet.ParquetRowGroupWriter.WriteColumnAsync(DataColumn column, Dictionary`2 customMetadata, CancellationToken cancellationToken)

Failing test

No response

aloneguid commented 1 year ago

IronCompress is already on latest snappy version, but there is a bug in narrowing data type as you have mentioned. Thanks for reporting this, I'll try to reproduce and get some fixes in.

In the meantime, you can try to write in batches (row groups) as it looks like columns are massive anyway and readers will have issues decompressing them if ram is an issue.