aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
600 stars 151 forks source link

[BUG]: ParquetSerializer.DeserializeAsync - "Not a parquet" file when using buffer with fixed size #437

Closed edguer closed 3 days ago

edguer commented 10 months ago

Library Version

4.17.0

OS

Windows

OS Architecture

64 bit

How to reproduce?

Write this code or similar:

var bytes = [some byte array];

// Using a 4096 buffer from the array pool
using var memoryStream = new MemoryStream(ArrayPool<byte>.Shared.Rent(4096));
await memoryStream.WriteAsync(bytes, 0, bytes.Length);
return await ParquetSerializer.DeserializeAsync<T>(memoryStream, ParquetOptions.ParquetOptions);

And you get:

  System.IO.IOException: not a parquet file, head: 50415231, tail: 00000000
Stack Trace: 
  ParquetActor.ValidateFileAsync()
  ParquetReader.InitialiseAsync(CancellationToken cancellationToken)
  ParquetReader.CreateAsync(Stream input, ParquetOptions parquetOptions, Boolean leaveStreamOpen, CancellationToken cancellationToken)
  ParquetSerializer.DeserializeAsync[T](Stream source, ParquetOptions options, CancellationToken cancellationToken)

The reason is the ValidateFileAsync method seeks the tail at the end of stream, which will always be 0 when the buffer is not filled:

https://github.com/aloneguid/parquet-dotnet/blob/9dcd02ec3e2120d31dd85ed31afc9255c2fc8f5c/src/Parquet/ParquetActor.cs#L42C50-L42C50

        protected async Task ValidateFileAsync() {
            _fileStream.Seek(0, SeekOrigin.Begin);
            byte[] head = await _fileStream.ReadBytesExactlyAsync(4);

            _fileStream.Seek(-4, SeekOrigin.End);
            byte[] tail = await _fileStream.ReadBytesExactlyAsync(4);

            if(!MagicBytes.SequenceEqual(head) || !MagicBytes.SequenceEqual(tail))
                throw new IOException($"not a parquet file, head: {head.ToHexString()}, tail: {tail.ToHexString()}");
        }

This would probably fix it:

            _fileStream.Seek((_fileStream.Position - 4), SeekOrigin.Begin);
            byte[] tail = await _fileStream.ReadBytesExactlyAsync(4);

Failing test

No response

aloneguid commented 10 months ago

Sounds good. Please raise a PR if that's ok.