aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 141 forks source link

[BUG]: ParquetSerializer can sometimes fail when populating _typeToAssembler cache in parallel #411

Closed Mitkee closed 7 months ago

Mitkee commented 9 months ago

Library Version

4.16.3

OS

Windows

OS Architecture

64 bit

How to reproduce?

Unfortunately, no clear steps to reproduce, but seen this happening occasionally in production:


System.InvalidOperationException:
   at System.ThrowHelper.ThrowInvalidOperationException_ConcurrentOperationsNotSupported (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Collections.Generic.Dictionary`2.FindValue (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at System.Collections.Generic.Dictionary`2.TryGetValue (System.Private.CoreLib, Version=6.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e)
   at Parquet.Serialization.ParquetSerializer.GetAssembler (Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926)
   at Parquet.Serialization.ParquetSerializer+<DeserializeAsync>d__6`1.MoveNext (Parquet, Version=4.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926)

Since the _typeToAssembler cache is not concurrent, triggering multiple operations populating this cache in parallel is the cause.

Failing test

Despite effort, I've not been able to reproduce this in a testcase :(
Mitkee commented 9 months ago

As a workaround:

public static class ParquetCachePopulator
{
    public static async Task PopulateParquetTypeCache<T>() where T: new()
    {
        var row = new List<T>() { new T() };
        using var stream = new MemoryStream();
        await ParquetSerializer.SerializeAsync(row, stream);
        stream.Position = 0;
        await ParquetSerializer.DeserializeAsync<T>(stream);
    }
}

..and then running

//Pre-load parquet types:
await ParquetCachePopulator.PopulateParquetTypeCache<StoredContract>();
await ParquetCachePopulator.PopulateParquetTypeCache<StoredPublicTrade>();
await ParquetCachePopulator.PopulateParquetTypeCache<FullOrderEventParquetModel>();
await ParquetCachePopulator.PopulateParquetTypeCache<OrderEventParquetModelForFtpExport>();
await ParquetCachePopulator.PopulateParquetTypeCache<PrivateTradeEventToOrderMap>();
await ParquetCachePopulator.PopulateParquetTypeCache<TradeEventParquetModel>();

pre-startup might very well do the trick. I'll give that a spin and will report back in a few days to see if it stays stable. Now I'm not against pre-loading known types, could even consider making that mandatory to avoid the issue? 🤔 Alternatively, fix the concurrency problem in the type cache, which will most likely come with some minor performance hit.