dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.23k stars 4.45k forks source link

[NOREVIEW][NOMERGE] HTTP/3 stress data corruption hunt #101624

Closed rzikm closed 1 week ago

rzikm commented 2 weeks ago

This PR is just for running http stress pipeline with custom code which will hopefully allow us to diagnose https://github.com/dotnet/runtime/issues/76183.

dotnet-policy-service[bot] commented 2 weeks ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

Got a hit? https://dev.azure.com/dnceng-public/public/_build/results?buildId=658225&view=logs&j=22fea640-1099-5f32-ec5d-316ad83f359a&t=04294904-0aab-5400-4e47-3fa8cb0f125b

client_1  | Error from iteration 112193 ( 8: POST Duplex Slow) in task 1 with 412172 successes / 1 fails:
client_1  | System.IndexOutOfRangeException: Index was outside the bounds of the array.
client_1  |    at HttpStress.ClientOperations.ValidateByteContent(Byte[] content, Int32 length, String details) in /app/ClientOperations.cs:line 512
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_8>d.MoveNext()
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 204
client_1  | 

Weird that it got through all the validation, I fixed the validation to account for shorter payloads. Let's see if we get lucky again.

rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

Another hit 🥳

https://dev.azure.com/dnceng-public/public/_build/results?buildId=658739&view=logs&j=22fea640-1099-5f32-ec5d-316ad83f359a&t=04294904-0aab-5400-4e47-3fa8cb0f125b

server_1  | Process terminated. Assertion failed.
server_1  | Diverging at offset 15, expected 0x0f, got 0x15
server_1  |    at HttpStress.StressServer.<>c.<<MapRoutes>b__7_8>d.MoveNext() in /app/StressServer.cs:line 306
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs:line 368
server_1  |    at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.ReadAsyncInternal(Memory`1 destination, CancellationToken cancellationToken)
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http3.Http3MessageBody.ReadAsync(CancellationToken cancellationToken)
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at System.Threading.ThreadPoolWorkQueue.Dispatch() in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs:line 913
server_1  |    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() in /_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs:line 128
server_1  | [createdump] Gathering state for process 8 dotnet
server_1  | [createdump] Crashing thread 0020 signal 6 (0006)
server_1  | [createdump] Writing minidump with heap to file /dumps-share/coredump.8
server_1  | [createdump] Written 498233344 bytes (121639 pages) to core file
server_1  | [createdump] Target process is alive
server_1  | [createdump] Dump successfully written in 668ms
server_1  | HttpStress failed
server_1  | Aborted (core dumped)

Let's see if the dump contains something actionable, but hitting assert at this point seems to suggest there may be a bug in Kestrel O.o (as we have validated incoming http frames on QuicStream recevie path.)

rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

More hits

client_1  | Error from iteration 93893 ( 8: POST Duplex Slow) in task 1 with 350986 successes / 1 fails:
client_1  | System.Exception: Expected response content length 179, got 67. server checksum does not match client value.
client_1  |    at HttpStress.ClientOperations.ValidateByteContent(Byte[] content, Int32 length, String details) in /app/ClientOperations.cs:line 522
client_1  |    at HttpStress.ClientOperations.<>c.<<get_Operations>b__1_8>d.MoveNext()
client_1  | --- End of stack trace from previous location ---
client_1  |    at HttpStress.StressClient.<>c__DisplayClass17_0.<<StartCore>g__RunWorker|0>d.MoveNext() in /app/StressClient.cs:line 204
client_1  | 
server_1  | Process terminated. Assertion failed.
server_1  | Diverging at offset 160, expected 0xa0, got 0xa1
server_1  |    at HttpStress.StressServer.<>c.<<MapRoutes>b__7_8>d.MoveNext() in /app/StressServer.cs:line 306
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread) in /_/src/libraries/System.Private.CoreLib/src/System/Runtime/CompilerServices/AsyncTaskMethodBuilderT.cs:line 368
server_1  |    at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http.HttpRequestStream.ReadAsyncInternal(Memory`1 destination, CancellationToken cancellationToken)
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at Microsoft.AspNetCore.Server.Kestrel.Core.Internal.Http3.Http3MessageBody.ReadAsync(CancellationToken cancellationToken)
server_1  |    at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state) in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ExecutionContext.cs:line 179
server_1  |    at System.Threading.ThreadPoolWorkQueue.Dispatch() in /_/src/libraries/System.Private.CoreLib/src/System/Threading/ThreadPoolWorkQueue.cs:line 913
server_1  |    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() in /_/src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs:line 128
server_1  | [createdump] Gathering state for process 7 dotnet
server_1  | [createdump] Crashing thread 0020 signal 6 (0006)
server_1  | [createdump] Writing minidump with heap to file /dumps-share/coredump.7
server_1  | [createdump] Written 515735552 bytes (125912 pages) to core file
server_1  | [createdump] Target process is alive
server_1  | [createdump] Dump successfully written in 641ms
server_1  | Aborted (core dumped)
server_1  | HttpStress failed
rzikm commented 2 weeks ago

The last hit above gives some clues. Starting from the locals at the callsite and going HttpRequestStream->HttpRequestPipeReader->Http3MessageBody->Http3Stream->RequestBodyPipe, I see

0:018> !DumpObj /d 00007ef4eda6ea38
Name:        System.IO.Pipelines.Pipe
MethodTable: 00007f3500cbd9f8
EEClass:     00007f3500c9e040
Tracked Type: false
Size:        288(0x120) bytes
File:        /live-runtime-artifacts/testhost/net9.0-linux-Release-x64/shared/Microsoft.AspNetCore.App/9.0.0-preview.4.24223.1/System.IO.Pipelines.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007f3500cbd278  4000016       70 ...ufferSegmentStack  1 instance 00007ef4eda6eaa8 _bufferSegmentPool
00007f3500cbe228  4000017        8 ...DefaultPipeReader  0 instance 00007ef4eda6eba8 _reader
00007f3500cbf5d8  4000018       10 ...DefaultPipeWriter  0 instance 00007ef4eda6ebc8 _writer
00007f3500cbdc80  4000019       18 ...lines.PipeOptions  0 instance 00007ef4eda6e9f0 _options
00007f34fe855ef8  400001a       20        System.Object  0 instance 00007ef4eda6eb58 _sync
00007f34fe8aa6e0  400001b       40         System.Int64  1 instance              356 _unconsumedBytes
00007f34fe8aa6e0  400001c       48         System.Int64  1 instance                0 _unflushedBytes
00007f3500cbd550  400001d       80 ...nes.PipeAwaitable  1 instance 00007ef4eda6eab8 _readerAwaitable
00007f3500cbd550  400001e       b0 ...nes.PipeAwaitable  1 instance 00007ef4eda6eae8 _writerAwaitable
00007f3500cbd6f8  400001f       e0 ...es.PipeCompletion  1 instance 00007ef4eda6eb18 _writerCompletion
00007f3500cbd6f8  4000020       f0 ...es.PipeCompletion  1 instance 00007ef4eda6eb28 _readerCompletion
00007f34fe8aa6e0  4000021       50         System.Int64  1 instance               75 _lastExaminedIndex
00007f3500cce7b0  4000022       28 ...nes.BufferSegment  0 instance 00007ef4eda83a38 _readHead
00007f34fe881188  4000023       58         System.Int32  1 instance               75 _readHeadIndex
00007f34fe85d220  4000024       68       System.Boolean  1 instance                0 _disposed
00007f3500cce7b0  4000025       30 ...nes.BufferSegment  0 instance 00007ef4eda83a38 _readTail
00007f34fe881188  4000026       5c         System.Int32  1 instance              431 _readTailIndex
00007f34fe881188  4000027       60         System.Int32  1 instance                0 _minimumReadBytes
00007f3500cce7b0  4000028       38 ...nes.BufferSegment  0 instance 00007ef4eda83a38 _writingHead
00007f35007cbf20  4000029      100 ...Private.CoreLib]]  1 instance 00007ef4eda6eb38 _writingHeadMemory
00007f34fe881188  400002a       64         System.Int32  1 instance                0 _writingHeadBytesBuffered
00007f3500cbd900  400002b      110 ...ipeOperationState  1 instance 00007ef4eda6eb48 _operationState
00007f34ff817f38  400000f       10 ...Private.CoreLib]]  0   static 00007ef4ea00b5a8 s_signalReaderAwaitable
00007f34ff817f38  4000010       18 ...Private.CoreLib]]  0   static 00007ef4ea00b5e8 s_signalWriterAwaitable
00007f34ff817f38  4000011       20 ...Private.CoreLib]]  0   static 00007ef4ea00b628 s_invokeCompletionCallbacks
00007f34fef13130  4000012       28 ...g.ContextCallback  0   static 00007ef4ea00b668 s_executionContextRawCallback
00007f34ffe2ea18  4000013       30 ...endOrPostCallback  0   static 00007ef4ea00b6a8 s_syncContextExecutionContextCallback
00007f34ffe2ea18  4000014       38 ...endOrPostCallback  0   static 00007ef4ea00b6e8 s_syncContextExecuteWithoutExecutionContextCallback
00007f34ff817f38  4000015       40 ...Private.CoreLib]]  0   static 00007ef4ea00b728 s_scheduleWithExecutionContextCallback

The _readHead points to buffer where incoming data are stored, _readHeadIndex is the offset where the next byte is. looking into the underlying memory gives us

0:018> !DumpObj /d 00007ef4ea413bc0
Name:        System.Byte[]
MethodTable: 00007f34febd8860
EEClass:     00007f34febd8810
Tracked Type: false
Size:        4120(0x1018) bytes
Array:       Rank 1, Number of elements 4096, Type Byte (Print Array)
Content:     VWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~.......................................................................................
Fields:
None
0:018> db 7ef4ea413bd0 L1000
00007ef4`ea413bd0  56 57 58 59 5a 5b 5c 5d-5e 5f 60 61 62 63 64 65  VWXYZ[\]^_`abcde
00007ef4`ea413be0  66 67 68 69 6a 6b 6c 6d-6e 6f 70 71 72 73 74 75  fghijklmnopqrstu
00007ef4`ea413bf0  76 77 78 79 7a 7b 7c 7d-7e 7f 80 81 82 83 84 85  vwxyz{|}~.......
00007ef4`ea413c00  86 87 88 89 8a 8b 8c 8d-8e 8f 90 91 92 93 94 95  ................
00007ef4`ea413c10  96 97 98 99 9a 9b 9c 9d-9e 9f a1 a2 a3 a4 a5 a6  ................
...

The byte at offset 74 (The last read byte) is the a1 in the last row, where a0 should be, 1 byte of the data stream is simply missing on the server side. The validatio asserts that the data are received from MsQuic correctly, but I will add one additional validation layer to confirm the data is really read correctly from QuicStream to eliminate possibility of error on our buffering layer.

rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 2 weeks ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 2 weeks ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http

azure-pipelines[bot] commented 1 week ago
Azure Pipelines successfully started running 1 pipeline(s).
rzikm commented 1 week ago

/azp run runtime-libraries stress-http