aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[BUG]: Performance decrease since > 4.11 or higher when writing into Azure blob stream #407

Closed andreaslennartz closed 6 months ago

andreaslennartz commented 9 months ago

Library Version

4.11, 4.12, 4.13, 4.14, 4.15, 4.16

OS

Windows 11,Azure, .NET 6

OS Architecture

64 bit

How to reproduce?

First of all, thanks for creating this great library! We encountered an issue when using Parquet.NET to directly write data using a stream on an Azure blob.

Bug details: Parquet.NET supports writing into existing stream (e.g. File stream). We write data into an Azure blob using the nuget Package Azure.Storage.Blobs and a BlockBlobClient. Here is an example how the writer is initialized:

var blockBlobClient = new BlockBlobClient(connectionString, containerName, name);
using var stream = blockBlobClient.OpenWrite(true, null);
using StreamWriter streamWriter = new StreamWriter(stream);
using var parquetWriter = await ParquetWriter.CreateAsync(schema, streamWriter.BaseStream, null, append: false);

Since we upgraded from Parquet.NET version 4.10.1 to a higher version (e.g. the current 4.16.4, but all other version higher than 4.11 are also affected), the time of writing into the blob stream decreased significantly (about 10x to 30x times slower).

Please see the attached example project that shows how to reproduce the issue (an Azure storage account is needed to run the test, please reach out to me if don't have access to one, I can share a temp account for testing).

How to reproduce the issue:

  1. Start the example project
  2. Set the connection string to Azure Blob Storage (contact me if you don't have access to an Azure storage account)
  3. Create a container with the same name as the variable 'containerName'
  4. Run the code with Parquet.NET 4.11.X or higher, measure exeuction time
  5. Downgrade package reference to Parquet.NET 4.10.1
  6. Run the code again, compare execution time with previous run

Result in my environment: Upload time using Parquet.NET 4.10.1: < 10 sec Upload time using Parquet.NET > 4.11.X or higher: > 150 sec

Example project: AzureBlobStorageFlushIssue.zip

Proposed solution (Please note that these are only ideas which I want to share, please feel free to comment on them!)

The file Parquet.File.DataColumnWriter has a method CompressAndWriteAsync. Since Parquet.NET 4.11 or higher, this method has been adjust to not use the .NET MemoryStream anymore. Also, a _stream.Flush(); has been added. Therefore, the underlying stream is constantly flushed each time a column has been written. This works fine when writing into files, but when writing into Azure Blob Storage this constant flushing will significantly reduce the write performance.

Option 1: Remove the line _stream.Flush(); completely: image I did some test in my environment, and when commenting out this line (as in the screenshot), the performance is back to "normal".

Option 2: Extend the ParquetOptions with a property/flag that allows to turn of the constant flushing of a stream:

Option 3: Having a workaround. Not quite sure if/how this can be achieved, but perhaps the provided blob base stream that is passed into the ParquetWriter can be wrapped so that the Flush() method is ignored? Please advice if you see a good workaround for this.

Thanks for looking into this!

Failing test

No response

slateAGF commented 9 months ago

Ohh that flush...it was removed in 4.9.1, which caused sporadic problems for us. https://github.com/aloneguid/parquet-dotnet/commit/abdd0c45a5634fd494e31438f550af7b5e6ceb52

andreaslennartz commented 9 months ago

Hi @slateAGF , thanks for getting back. Tried to open your link, but github says "No results matched your search."

Sorry, the screenshot was perhaps misleading - the Flush() is still there: https://github.com/aloneguid/parquet-dotnet/blob/fe4f82812318ba0679553fb2da9c30b401961f1d/src/Parquet/File/DataColumnWriter.cs#L87

Could you remove this line again for the next version(s)? It causes issues when writing into Azure Blob storage.

aloneguid commented 6 months ago

released in 4.18, thanks a lot

andreaslennartz commented 6 months ago

Perfect. Thank you for maintaining this great library!