aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

Possible memory leak in DataColumnWriter #436

Open TapaniAalto opened 7 months ago

TapaniAalto commented 7 months ago

While investigating a potential memory leak in my Azure Function app, I noticed that the Managed Memory Tool in Visual Studio tells me that I have many objects in the .NET managed heap. When looking through those objects (byte[]), they seem to have Microsoft.IO.RecyclableMemoryStreamManager as their root.

Parquet.Net uses RecyclableMemoryStreamManager in DataColumnWriter.

In it's documentation it states that

Important!: If you do not set MaximumFreeLargePoolBytes and MaximumFreeSmallPoolBytes there is the possibility for unbounded memory growth!

And also that:

the RecyclableMemoryStreamManager will use the properties MaximumFreeSmallPoolBytes and MaximumFreeLargePoolBytes to determine whether to put those buffers back in the pool, or let them go (and thus be garbage collected). It is through these properties that you determine how large your pool can grow. If you set these to 0, you can have unbounded pool growth, which is essentially indistinguishable from a memory leak.

It seems that the DataColumnWriter does not set these properties, so I guess that might be the reason for my app's high memory usage.

Should those MaximumFreeSmallPoolBytes and MaximumFreeLargePoolBytes properties be somehow user configurable? Maybe via ParquetOptions?

aloneguid commented 7 months ago

Thanks for looking into this. Seems like a good idea to add an option and maybe also set some default limit.