Closed andreaslennartz closed 6 months ago
Ohh that flush...it was removed in 4.9.1, which caused sporadic problems for us. https://github.com/aloneguid/parquet-dotnet/commit/abdd0c45a5634fd494e31438f550af7b5e6ceb52
Hi @slateAGF , thanks for getting back. Tried to open your link, but github says "No results matched your search."
Sorry, the screenshot was perhaps misleading - the Flush()
is still there:
https://github.com/aloneguid/parquet-dotnet/blob/fe4f82812318ba0679553fb2da9c30b401961f1d/src/Parquet/File/DataColumnWriter.cs#L87
Could you remove this line again for the next version(s)? It causes issues when writing into Azure Blob storage.
released in 4.18, thanks a lot
Perfect. Thank you for maintaining this great library!
Library Version
4.11, 4.12, 4.13, 4.14, 4.15, 4.16
OS
Windows 11,Azure, .NET 6
OS Architecture
64 bit
How to reproduce?
First of all, thanks for creating this great library! We encountered an issue when using Parquet.NET to directly write data using a stream on an Azure blob.
Bug details: Parquet.NET supports writing into existing stream (e.g. File stream). We write data into an Azure blob using the nuget Package
Azure.Storage.Blobs
and aBlockBlobClient
. Here is an example how the writer is initialized:Since we upgraded from Parquet.NET version 4.10.1 to a higher version (e.g. the current 4.16.4, but all other version higher than 4.11 are also affected), the time of writing into the blob stream decreased significantly (about 10x to 30x times slower).
Please see the attached example project that shows how to reproduce the issue (an Azure storage account is needed to run the test, please reach out to me if don't have access to one, I can share a temp account for testing).
How to reproduce the issue:
Result in my environment: Upload time using Parquet.NET 4.10.1: < 10 sec Upload time using Parquet.NET > 4.11.X or higher: > 150 sec
Example project: AzureBlobStorageFlushIssue.zip
Proposed solution (Please note that these are only ideas which I want to share, please feel free to comment on them!)
The file
Parquet.File.DataColumnWriter
has a methodCompressAndWriteAsync
. Since Parquet.NET 4.11 or higher, this method has been adjust to not use the .NET MemoryStream anymore. Also, a_stream.Flush();
has been added. Therefore, the underlying stream is constantly flushed each time a column has been written. This works fine when writing into files, but when writing into Azure Blob Storage this constant flushing will significantly reduce the write performance.Option 1: Remove the line
I did some test in my environment, and when commenting out this line (as in the screenshot), the performance is back to "normal".
_stream.Flush();
completely:Option 2: Extend the ParquetOptions with a property/flag that allows to turn of the constant flushing of a stream:
Option 3: Having a workaround. Not quite sure if/how this can be achieved, but perhaps the provided blob base stream that is passed into the ParquetWriter can be wrapped so that the
Flush()
method is ignored? Please advice if you see a good workaround for this.Thanks for looking into this!
Failing test
No response