aloneguid / parquet-dotnet

Fully managed Apache Parquet implementation
https://aloneguid.github.io/parquet-dotnet/
MIT License
542 stars 140 forks source link

[WIP] Modular Encryption Support When Reading Parquet Files #480

Open mukunku opened 4 months ago

mukunku commented 4 months ago

Summary

I made significant progress on getting Footer Decryption working with parquet files (#191).

I'm opening this work-in-progress pull request with hopes that some other folks can help get this across the finish line.

AES_GCM_V1

Thanks to a test file file @pzatschl shared with me I was able to implement the Aes Gcm V1 encryption algorithm.

link to code

AES_GCM_CTR_V1

I also implemented the Aes Gcm Ctr V1 encryption algorithm, however I don't have any test files to confirm it's working 🙃

link to code

How to test

Checkout the unit test I added that tests the sample file I mentioned above: link to code image

However, even though I can decrypt the test file successfully, the data itself doesn't seem to be valid. So I had to add this try-catch as a temporary workaround. link to code We should remove this once we have a proper test file. (Unfortunately I don't have any other test files )

mukunku commented 4 months ago

I was able to tidy up the PR. However there is a bug that happens when running dotnet test which is breaking the PR checks. I was able to track it down to the following error although I have no clue why it's happening:

The active test run was aborted. Reason: Test host process crashed : Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
   at System.MemoryExtensions.AsSpan[[System.Int32, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]](Int32[], Int32)
   at Parquet.File.PackedColumn.AllocateOrGetDictionaryIndexes(Int32)
   at Parquet.File.DataColumnReader.ReadColumn(System.Span`1<Byte>, Parquet.Meta.Encoding, Int64, Int32, Parquet.File.PackedColumn)
   at Parquet.File.DataColumnReader+<ReadDataPageV1Async>d__15.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Parquet.File.DataColumnReader+<ReadDataPageV1Async>d__15, Parquet, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]](<ReadDataPageV1Async>d__15 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder.Start[[Parquet.File.DataColumnReader+<ReadDataPageV1Async>d__15, Parquet, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]](<ReadDataPageV1Async>d__15 ByRef)
   at Parquet.File.DataColumnReader.ReadDataPageV1Async(Parquet.Meta.PageHeader, Parquet.File.PackedColumn)
   at Parquet.File.DataColumnReader+<ReadAsync>d__10.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[Parquet.File.DataColumnReader+<ReadAsync>d__10, Parquet, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]](<ReadAsync>d__10 ByRef)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1[[System.__Canon, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].Start[[Parquet.File.DataColumnReader+<ReadAsync>d__10, Parquet, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]](<ReadAsync>d__10 ByRef)
   at Parquet.File.DataColumnReader.ReadAsync(System.Threading.CancellationToken)
   at Parquet.ParquetRowGroupReader.ReadColumnAsync(Parquet.Schema.DataField, System.Threading.CancellationToken)
   at Parquet.Test.ParquetReaderOnTestFilesTest+<DecryptFile_UTF8_AesGcmV1_192bit>d__2.MoveNext()
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.Test.ParquetReaderOnTestFilesTest+<DecryptFile_UTF8_AesGcmV1_192bit>d__2, Parquet.Test, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]].ExecutionContextCallback(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.Test.ParquetReaderOnTestFilesTest+<DecryptFile_UTF8_AesGcmV1_192bit>d__2, Parquet.Test, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]].MoveNext(System.Threading.Thread)
   at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1+AsyncStateMachineBox`1[[System.Threading.Tasks.VoidTaskResult, System.Private.CoreLib, Version=8.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e],[Parquet.Test.ParquetReaderOnTestFilesTest+<DecryptFile_UTF8_AesGcmV1_192bit>d__2, Parquet.Test, Version=1.0.0.0, Culture=neutral, PublicKeyToken=d380b3dee6d01926]].MoveNext()
   at Xunit.Sdk.AsyncTestSyncContext+<>c__DisplayClass7_0.<Post>b__1(System.Object)
   at Xunit.Sdk.MaxConcurrencySyncContext.RunOnSyncContext(System.Threading.SendOrPostCallback, System.Object)
   at Xunit.Sdk.MaxConcurrencySyncContext+<>c__DisplayClass11_0.<WorkerThreadProc>b__0(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at Xunit.Sdk.ExecutionContextHelper.Run(System.Object, System.Action`1<System.Object>)
   at Xunit.Sdk.MaxConcurrencySyncContext.WorkerThreadProc()
   at Xunit.Sdk.XunitWorkerThread+<>c.<QueueUserWorkItem>b__5_0(System.Object)
   at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object)
   at System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef, System.Threading.Thread)

The active Test Run was aborted because the host process exited unexpectedly. Please inspect the call stack above, if available, to get more information about where the exception originated from.
The test running when the crash occurred:
Parquet.Test.ParquetReaderOnTestFilesTest.DecryptFile_UTF8_AesGcmV1_192bit

This test may, or may not be the source of the crash.
mukunku commented 4 months ago

Okay, some findings.

If any test runs after my new file decryption test in the same xunit collection it crashes the CLR. I moved my test to its own test collection and disabled parallelization which essentially means xunit will run my test in isolation. see: https://github.com/aloneguid/parquet-dotnet/pull/480/commits/9e0bbbe06c2db11158627644162f20a49ddec1df

This way my test sometimes works; It randomly fails with similar memory mismanagement issues. So it's flaky at the moment. This is just a band-aid to get the PR green. I'm sure i'm doing something stupid somewhere that's causing this issue but I haven't been able to find it so far.