dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.18k stars 4.72k forks source link

corerun with DOTNET_EnableEventLog: "Heap contamination detected! "HeapFree was called on a heap other than the one that memory was allocated from." #56888

Closed am11 closed 2 years ago

am11 commented 3 years ago

Based on current main (5bef921), build product and tests in Debug configuration on Ubuntu 18.04 x64:

./build.sh
./src/tests/build.sh -p:LibrariesConfiguration=Debug

Running a simple hello world app works:

cd artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0
corerun /tmp/helloworld.dll

running the same app with event log enabled fails an assertion:

DOTNET_EnableEventLog=1 corerun /tmp/helloworld.dll

Assert failure(PID 543 [0x0000021f], Thread: 543 [0x021f]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0/corerun

corehost, OTOH, has no issue with DOTNET_EnableEventLog=1 (or COMPlus_EnableEventLog=1).

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

am11 commented 3 years ago

cc @josalem, @janvorli I found this when writing LTTng tests (wip branch: https://github.com/dotnet/runtime/compare/main...am11:feature/tracing/tests/lttng-events) and reduced it to a simple hello world. Setting DOTNET_EnableEventLog=1 on anything chokes corerun.

josalem commented 3 years ago

I'm not sure what would be different for LTTng between corehost and corerun. That seems odd. Is there any more to the stack when the assert is thrown? clrhost_nodependencies.cpp:296 is just ClrFree. I'll see if I can get an LTTng environment set up. If you can easily reproduce this though, it might be faster if you can run it under LLDB and get a bt at the assert.

(lldb corerun -- helloworld.dll -> (lldb) settings set target.env-vars DOTNET_EnableEventLog=1 -> etc.)

am11 commented 3 years ago

This is only happening in debug build, I tested with release build and the test passed.

Here is the backtrace of helloworld.dll from lldb, when launching with corerun:

root@968b4a55cce2:/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root# lldb -f ./corerun /tmp/helloworld/bin/Release/net6.0/helloworld.dll
(lldb) target create "./corerun"
Current executable set to './corerun' (x86_64).
(lldb) settings set -- target.run-args  "/tmp/helloworld/bin/Release/net6.0/helloworld.dll"
(lldb) settings set target.env-vars DOTNET_EnableEventLog=1
(lldb) r
Process 1756 launched: './corerun' (x86_64)

Assert failure(PID 1756 [0x000006dc], Thread: 1756 [0x06dc]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun

Process 1756 stopped
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
    frame #0: 0x00007ffff6b52559 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
   6    
   7    LEAF_ENTRY DBG_DebugBreak, _TEXT
   8            int3
-> 9            ret
   10   LEAF_END_MARKED DBG_DebugBreak, _TEXT
   11   
(lldb) bt all
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
  * frame #0: 0x00007ffff6b52559 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
    frame #1: 0x00007ffff6aa43cb libcoreclr.so`::DebugBreak() at debug.cpp:406
    frame #2: 0x00007ffff680e828 libcoreclr.so`::DbgAssertDialog(szFile="/runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp", iLine=296, szExpr="!\"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\\n\" \"Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it.\"") at debug.cpp:697
    frame #3: 0x00007ffff67c13c2 libcoreclr.so`ClrFree(p=0x0000555555661c30) at clrhost_nodependencies.cpp:295
    frame #4: 0x00007ffff67c1175 libcoreclr.so`operator delete[](p=0x0000555555661c40) at clrhost_nodependencies.cpp:410
    frame #5: 0x00007ffff6bab80d libcoreclr.so`::FireEtXplatRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5fa9600, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so") at eventprovdotnetruntime.cpp:4905
    frame #6: 0x00007ffff665ee35 libcoreclr.so`FireEtwRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5fa9600, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so", ActivityId=0x0000000000000000, RelatedActivityId=0x0000000000000000) at clretwallmain.h:2811
    frame #7: 0x00007ffff664b58e libcoreclr.so`ETW::InfoLog::RuntimeInformation(type=0) at eventtrace.cpp:5111
    frame #8: 0x00007ffff6a73c1f libcoreclr.so`EEStartupHelper() at ceemain.cpp:767
    frame #9: 0x00007ffff6a75684 libcoreclr.so`EEStartup(this=0x00007fffffffda88, p=0x0000000000000000)::$_1::operator()(void*) const at ceemain.cpp:1153
    frame #10: 0x00007ffff6a72a5b libcoreclr.so`EEStartup() at ceemain.cpp:1155
    frame #11: 0x00007ffff6a72862 libcoreclr.so`EnsureEEStarted() at ceemain.cpp:321
    frame #12: 0x00007ffff6219bed libcoreclr.so`CorHost2::Start(this=0x0000555555634690) at corhost.cpp:101
    frame #13: 0x00007ffff6153006 libcoreclr.so`::coreclr_initialize(exePath="/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun", appDomainFriendlyName="corerun", propertyCount=4, propertyKeys=0x000055555558a0c0, propertyValues=0x000055555558a140, hostHandle=0x0000555555566068, domainId=0x0000555555566070) at unixinterface.cpp:251
    frame #14: 0x000055555555f127 corerun`run(config=0x00007fffffffe460) at corerun.cpp:352
    frame #15: 0x000055555555c616 corerun`main(argc=2, argv=0x00007fffffffe5d8) at corerun.cpp:554
    frame #16: 0x00007ffff6ca1bf7 libc.so.6`__libc_start_main(main=(corerun`main at corerun.cpp:546), argc=2, argv=0x00007fffffffe5d8, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe5c8) at libc-start.c:310
    frame #17: 0x000055555555b02a corerun`_start + 42
  thread #2, name = 'corerun-ust'
    frame #0: 0x00007ffff79c2657 libpthread.so.0`__libc_recvmsg + 71
    frame #1: 0x00007ffff5985c23 liblttng-ust.so.0`ustcomm_recv_unix_sock + 179
    frame #2: 0x00007ffff598b241 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 2241
    frame #3: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #4: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #3, name = 'corerun-ust'
    frame #0: 0x00007ffff6d9b639 libc.so.6`syscall at syscall.S:38
    frame #1: 0x00007ffff598abd3 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 595
    frame #2: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #3: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #4, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fffed55ca20, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff6b30459 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(this=0x0000555555638900, iTimeout=-1, pRecvBuf="", iBytes=1) at synchmanager.cpp:2233
    frame #2: 0x00007ffff6b2f618 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(this=0x0000555555638900, iPollTimeout=-1, pswcWorkerCmd=0x00007fffed55cde4, pshridMarshaledData=0x00007fffed55cdd8, pdwData=0x00007fffed55cdd4) at synchmanager.cpp:2011
    frame #3: 0x00007ffff6b2e1c6 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(pArg=0x0000555555638900) at synchmanager.cpp:1714
    frame #4: 0x00007ffff6b4d315 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555639620) at thread.cpp:1862
    frame #5: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #6: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #5, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fff70008ee0, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff689db13 libcoreclr.so`ipc_poll_fds(fds=0x00007fff70008ee0, nfds=1, timeout=4294967295) at ds-ipc-pal-socket.c:485
    frame #2: 0x00007ffff689d904 libcoreclr.so`ds_ipc_poll(poll_handles_data=0x00007fffecd5bb60, poll_handles_data_len=1, timeout_ms=4294967295, callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc-pal-socket.c:1082
    frame #3: 0x00007ffff6a5177b libcoreclr.so`ds_ipc_stream_factory_get_next_available_stream(callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc.c:395
    frame #4: 0x00007ffff6a4eba7 libcoreclr.so`server_thread(data=0x0000000000000000) at ds-server.c:127
    frame #5: 0x00007ffff6b4d315 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555657640) at thread.cpp:1862
    frame #6: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #7: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95

and without settings set target.env-vars DOTNET_EnableEventLog=1, it works fine:

(lldb) r
Process 1576 launched: './corerun' (x86_64)
Hello World!
Process 1576 exited with status = 0 (0x00000000) 

helloworld.dll was generated with:

$ /runtime/.dotnet/dotnet  new console -o /tmp/helloworld
$ cd /tmp/helloworld
$ /runtime/.dotnet/dotnet build -c Release
janvorli commented 3 years ago

I wonder if it is yet another case of a mismatch between standard new operator and our overriden delete[], like we've seen in mscordaccore. Isn't this code by any chance using the STL? That would be bad.

janvorli commented 3 years ago

Ok, seems no STL is there. But I can see the problem. The ResizeBuffer is in PAL and as such uses standard new operator. The code that asserts is built on top of PAL though (like runtime) and so it cannot use the standard delete operator and uses our override instead. The reason why it works in release builds is that we add extra information into the memory around the allocation only in non-release builds. So for release builds, our ClrFree becomes just free and it works.

janvorli commented 3 years ago

It seems the right fix would be to implement a function to delete the buffer in the eventproviderhelpers.cpp like the ResizeBuffer and call it instead of the delete. That way both new and delete would be guaranteed to be compatible.

ghost commented 3 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
Based on current main (5bef921), build product and tests in Debug configuration on Ubuntu 18.04 x64: ```sh ./build.sh ./src/tests/build.sh -p:LibrariesConfiguration=Debug ``` Running a simple hello world app works: ```sh cd artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0 corerun /tmp/helloworld.dll ``` running the same app with event log enabled fails an assertion: ```sh DOTNET_EnableEventLog=1 corerun /tmp/helloworld.dll Assert failure(PID 543 [0x0000021f], Thread: 543 [0x021f]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it." File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296 Image: /runtime/artifacts/bin/testhost/net6.0-Linux-Debug-x64/shared/Microsoft.NETCore.App/6.0.0/corerun ``` corehost, OTOH, has no issue with `DOTNET_EnableEventLog=1` (or `COMPlus_EnableEventLog=1`).
Author: am11
Assignees: -
Labels: `area-Diagnostics-coreclr`, `untriaged`
Milestone: -
josalem commented 3 years ago

To make sure we're on the same page, this is the ResizeBuffer you're talking about?

https://github.com/dotnet/runtime/blob/1fb151f63dca644347a5c608d7ab17f7cb8e1ccb/src/coreclr/pal/src/eventprovider/lttngprovider/eventproviderhelpers.cpp#L14-L39

Looks like EventPipe got around this by having ResizeBuffer re-defined on top of the PAL:

https://github.com/dotnet/runtime/blob/1fb151f63dca644347a5c608d7ab17f7cb8e1ccb/src/coreclr/scripts/genEventPipe.py#L419-L453

janvorli commented 3 years ago

Yes, I was talking about the one in the eventproviderhelpers.cpp. If you added a DeleteBuffer function there and called it instead of using the delete in the generated eventprovdotnetruntime.cpp, it would fix the issue. The way eventpipe did that would work as well, but it seems that the way I've suggested is kind of cleaner.

josalem commented 3 years ago

Thanks for the triage @janvorli. @am11 if you aren't planning to include that change in your PR, I'll go ahead and make it.

am11 commented 3 years ago

I am not sure how would the prospective DeleteBuffer look like, since we can neither use ClrFree (with corehost) nor delete (which is a placement operator in corerun). 😅

On unix, based on git grep 'new.*std.*nothrow' ':(top,icase,exclude)*win*', we have only used std::nothrow in two other places one of which uses 'std::unique_ptr' and the other one ArrayHolder (both are based on RAII; automatically destroy object when control goes out of scope): https://github.com/dotnet/runtime/blob/fa21e0bc7b579d74fdd7b48c600e7e4b5256ca7d/src/coreclr/debug/dbgutil/machoreader.cpp#L377 https://github.com/dotnet/runtime/blob/fa21e0bc7b579d74fdd7b48c600e7e4b5256ca7d/src/coreclr/gc/unix/events.cpp#L282

janvorli commented 3 years ago

@am11, I am not sure I understand your question. The new function code would be in src/coreclr/pal/src/eventprovider/lttngprovider/eventproviderhelpers.cpp, so it would use the same delete [] operator as the ResizeBuffer uses. The generated eventprovdotnetruntime.cpp would then call the DeleteBuffer instead of calling delete [] operator. Right now, one of the functions in the generated code looks like this:

extern "C" ULONG  FireEtXplatGCHeapStats(
    const unsigned __int64 GenerationSize0,
    const unsigned __int64 TotalPromotedSize0,
    const unsigned __int64 GenerationSize1,
    const unsigned __int64 TotalPromotedSize1,
    const unsigned __int64 GenerationSize2,
    const unsigned __int64 TotalPromotedSize2,
    const unsigned __int64 GenerationSize3,
    const unsigned __int64 TotalPromotedSize3,
    const unsigned __int64 FinalizationPromotedSize,
    const unsigned __int64 FinalizationPromotedCount,
    const unsigned int PinnedObjectCount,
    const unsigned int SinkBlockCount,
    const unsigned int GCHandleCount)
{
    if (!EventXplatEnabledGCHeapStats())
        return ERROR_SUCCESS;

    char stackBuffer[92];
    char *buffer = stackBuffer;
    size_t offset = 0;
    size_t size = 92;
    bool fixedBuffer = true;

    bool success = true;
    success &= WriteToBuffer(GenerationSize0, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize0, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize1, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize1, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize2, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize2, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GenerationSize3, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(TotalPromotedSize3, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(FinalizationPromotedSize, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(FinalizationPromotedCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(PinnedObjectCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(SinkBlockCount, buffer, offset, size, fixedBuffer);
    success &= WriteToBuffer(GCHandleCount, buffer, offset, size, fixedBuffer);

    if (!success)
    {
        if (!fixedBuffer)
            delete[] buffer;
        return ERROR_WRITE_FAULT;
    }

    do_tracepoint(DotNETRuntime, GCHeapStats, offset, buffer);

    if (!fixedBuffer)
        delete[] buffer;

    return ERROR_SUCCESS;
}

So instead of

    if (!fixedBuffer)
        delete[] buffer;

it would use

    if (!fixedBuffer)
        DeleteBuffer(buffer);
am11 commented 3 years ago

Something like this: https://github.com/am11/runtime/commit/f07b9474b6c10c934bbfdc53009309469d5e9230? we get an extra frame for DeleteBuffer, but it still ends up calling ClrFree with corerun:

$ DOTNET_EnableEventLog=1 lldb /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun /tmp/helloworld/bin/release/net6.0/helloworld.dll 
(lldb) target create "/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun"
Current executable set to '/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun' (x86_64).
(lldb) settings set -- target.run-args  "/tmp/helloworld/bin/release/net6.0/helloworld.dll"
(lldb) r
Process 630 launched: '/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun' (x86_64)

Assert failure(PID 630 [0x00000276], Thread: 630 [0x0276]): !"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\n" "Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it."
    File: /runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp Line: 296
    Image: /runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun

Process 630 stopped
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
    frame #0: 0x00007ffff6b52ac9 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
   6    
   7    LEAF_ENTRY DBG_DebugBreak, _TEXT
   8            int3
-> 9            ret
   10   LEAF_END_MARKED DBG_DebugBreak, _TEXT
   11   
(lldb) bt all
* thread #1, name = 'corerun', stop reason = signal SIGTRAP
  * frame #0: 0x00007ffff6b52ac9 libcoreclr.so`DBG_DebugBreak at debugbreak.S:9
    frame #1: 0x00007ffff6aa493b libcoreclr.so`::DebugBreak() at debug.cpp:406
    frame #2: 0x00007ffff6814cd8 libcoreclr.so`::DbgAssertDialog(szFile="/runtime/src/coreclr/utilcode/clrhost_nodependencies.cpp", iLine=296, szExpr="!\"Heap contamination detected! HeapFree was called on a heap other than the one that memory was allocated from.\\n\" \"Possible cause: you used new (executable) to allocate the memory, but didn't use DeleteExecutable() to free it.\"") at debug.cpp:697
    frame #3: 0x00007ffff67c7872 libcoreclr.so`ClrFree(p=0x0000555555628930) at clrhost_nodependencies.cpp:295
    frame #4: 0x00007ffff67c7625 libcoreclr.so`operator delete[](p=0x0000555555628940) at clrhost_nodependencies.cpp:410
    frame #5: 0x00007ffff6bddc67 libcoreclr.so`::DeleteBuffer(buffer=0x0000555555628940) at eventproviderhelpers.cpp:18
    frame #6: 0x00007ffff6bab6b8 libcoreclr.so`::FireEtXplatRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5faf640, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so") at eventprovdotnetruntime.cpp:4905
    frame #7: 0x00007ffff6664e65 libcoreclr.so`FireEtwRuntimeInformationStart(ClrInstanceID=0, Sku=2, BclMajorVersion=6, BclMinorVersion=0, BclBuildNumber=0, BclQfeNumber=0, VMMajorVersion=42, VMMinorVersion=42, VMBuildNumber=42, VMQfeNumber=42424, StartupFlags=8388611, StartupMode='\0', CommandLine=u"", ComObjectGuid=0x00007ffff5faf640, RuntimeDllPath=u"/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/libcoreclr.so", ActivityId=0x0000000000000000, RelatedActivityId=0x0000000000000000) at clretwallmain.h:2811
    frame #8: 0x00007ffff66515be libcoreclr.so`ETW::InfoLog::RuntimeInformation(type=0) at eventtrace.cpp:5111
    frame #9: 0x00007ffff6a7418f libcoreclr.so`EEStartupHelper() at ceemain.cpp:767
    frame #10: 0x00007ffff6a75bf4 libcoreclr.so`EEStartup(this=0x00007fffffffda18, p=0x0000000000000000)::$_1::operator()(void*) const at ceemain.cpp:1153
    frame #11: 0x00007ffff6a72fcb libcoreclr.so`EEStartup() at ceemain.cpp:1155
    frame #12: 0x00007ffff6a72dd2 libcoreclr.so`EnsureEEStarted() at ceemain.cpp:321
    frame #13: 0x00007ffff621fc1d libcoreclr.so`CorHost2::Start(this=0x0000555555600e10) at corhost.cpp:101
    frame #14: 0x00007ffff6159006 libcoreclr.so`::coreclr_initialize(exePath="/runtime/artifacts/tests/coreclr/Linux.x64.Debug/Tests/Core_Root/corerun", appDomainFriendlyName="corerun", propertyCount=4, propertyKeys=0x000055555558a070, propertyValues=0x000055555558a0f0, hostHandle=0x0000555555566068, domainId=0x0000555555566070) at unixinterface.cpp:251
    frame #15: 0x000055555555f127 corerun`run(config=0x00007fffffffe3f0) at corerun.cpp:352
    frame #16: 0x000055555555c616 corerun`main(argc=2, argv=0x00007fffffffe568) at corerun.cpp:554
    frame #17: 0x00007ffff6ca1bf7 libc.so.6`__libc_start_main(main=(corerun`main at corerun.cpp:546), argc=2, argv=0x00007fffffffe568, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe558) at libc-start.c:310
    frame #18: 0x000055555555b02a corerun`_start + 42
  thread #2, name = 'corerun-ust'
    frame #0: 0x00007ffff79c2657 libpthread.so.0`__libc_recvmsg + 71
    frame #1: 0x00007ffff598bc23 liblttng-ust.so.0`ustcomm_recv_unix_sock + 179
    frame #2: 0x00007ffff5991241 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 2241
    frame #3: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #4: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #3, name = 'corerun-ust'
    frame #0: 0x00007ffff6d9b639 libc.so.6`syscall at syscall.S:38
    frame #1: 0x00007ffff5990bd3 liblttng-ust.so.0`___lldb_unnamed_symbol62$$liblttng-ust.so.0 + 595
    frame #2: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #3: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #4, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fffef374a20, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff6b309c9 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadBytesFromProcessPipe(this=0x00005555555ff630, iTimeout=-1, pRecvBuf="", iBytes=1) at synchmanager.cpp:2233
    frame #2: 0x00007ffff6b2fb88 libcoreclr.so`CorUnix::CPalSynchronizationManager::ReadCmdFromProcessPipe(this=0x00005555555ff630, iPollTimeout=-1, pswcWorkerCmd=0x00007fffef374de4, pshridMarshaledData=0x00007fffef374dd8, pdwData=0x00007fffef374dd4) at synchmanager.cpp:2011
    frame #3: 0x00007ffff6b2e736 libcoreclr.so`CorUnix::CPalSynchronizationManager::WorkerThread(pArg=0x00005555555ff630) at synchmanager.cpp:1714
    frame #4: 0x00007ffff6b4d885 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x0000555555600350) at thread.cpp:1862
    frame #5: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #6: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
  thread #5, name = 'corerun'
    frame #0: 0x00007ffff6d94cb9 libc.so.6`__GI___poll(fds=0x00007fff70008ee0, nfds=1, timeout=-1) at poll.c:29
    frame #1: 0x00007ffff68a4053 libcoreclr.so`ipc_poll_fds(fds=0x00007fff70008ee0, nfds=1, timeout=4294967295) at ds-ipc-pal-socket.c:485
    frame #2: 0x00007ffff68a3e44 libcoreclr.so`ds_ipc_poll(poll_handles_data=0x00007fffeeb73b60, poll_handles_data_len=1, timeout_ms=4294967295, callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc-pal-socket.c:1082
    frame #3: 0x00007ffff6a51ceb libcoreclr.so`ds_ipc_stream_factory_get_next_available_stream(callback=(libcoreclr.so`server_warning_callback(char const*, unsigned int) at ds-server.c:110))(char const*, unsigned int)) at ds-ipc.c:395
    frame #4: 0x00007ffff6a4f117 libcoreclr.so`server_thread(data=0x0000000000000000) at ds-server.c:127
    frame #5: 0x00007ffff6b4d885 libcoreclr.so`CorUnix::CPalThread::ThreadEntry(pvParam=0x000055555561e340) at thread.cpp:1862
    frame #6: 0x00007ffff79b76db libpthread.so.0`start_thread + 219
    frame #7: 0x00007ffff6da171f libc.so.6`__GI___clone at clone.S:95
josalem commented 3 years ago

It looks like you changed the geneventpipe.py script as well. I think the geneventpipe.py script fixed this same issue by redefining ResizeBuffer on top of the PAL: https://github.com/dotnet/runtime/blob/e0472a1d165e706a38ed0e32b59e72bd97466aed/src/coreclr/scripts/genEventPipe.py#L430-L453

Does this still happen if you forward declare DeleteBuffer in genlttngprovider.py like ResizeBuffer is?

am11 commented 3 years ago

CMIIW, but from the backtrace, it is hitting our corerun-specific placement delete[] operator:

frame #4: 0x00007ffff67c1625 libcoreclr.so`operator delete[](p=0x0000555555628e60) at clrhost_nodependencies.cpp:410

which we don't want, right? It is hitting with main branch, with DeleteBuffer in pal.h as well as forward declaration of DeleteBuffer approach too (without declaring DeleBuffer in pal.h).

We can try defining the ResizeBuffer in genlttngprovider.py like it is done in geneventpipe.py without DeleteBuffer. I'll try that next.

am11 commented 3 years ago

We can try defining the ResizeBuffer in genlttngprovider.py like it is done in geneventpipe.py without DeleteBuffer. I'll try that next.

That didn't work either because geneventpipe.py and genlttngprovider.py seem to have different ways of using PAL. Former seem to generate source which builds as a runtime component (using typedefs from PAL BYTE* and friends). Latter is producing source which is building as part of PAL, where it is allowed to mix PAL and STL headers and they conflict in mysterious ways. I also tried commenting out placement operator in clrhost_nodependencies, which gave other runtime errors (free(): invalid size). I am not sure if I have fully understood the dynamics involved in this case.

if you aren't planning to include that change in your PR, I'll go ahead and make it.

Didn't mean to block; I gave it a fair try, but couldn't make it work. So please go ahead. 😅

josalem commented 3 years ago

Didn't mean to block; I gave it a fair try

Thanks for giving it a try! You did all the things I would have done as well. Seems there may be some extra complexity here. Since this is only happening in debug, I'll try to get it fixed in 6, but this is definitely a candidate for being bumped to 7.

@janvorli does @am11's experience trigger any top-of-mind thoughts for what might be going on?

josalem commented 3 years ago

pre-emptively moving this to 7. I'll hopefully be able to get this, but the debug only nature of it lowers the priority.

am11 commented 3 years ago

related, just noticed that compiling runtime with gcc-11 gives a lots of these errors on current main branch:

  [ 25%] Building CXX object pal/src/CMakeFiles/coreclrpal.dir/misc/sysinfo.cpp.o
  [ 25%] Building CXX object pal/src/CMakeFiles/coreclrpal.dir/misc/time.cpp.o
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventGCPerHeapHistory(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:492:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
    492 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:480:10: note: declared here
    480 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:499:18: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
    499 |         delete[] buffer;
        |                  ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:480:10: note: declared here
    480 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventEEStartupStart(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1753:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1753 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1741:10: note: declared here
   1741 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1760:18: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1760 |         delete[] buffer;
        |                  ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1741:10: note: declared here
   1741 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp: In function 'ULONG EventPipeWriteEventEEStartupEnd(LPCGUID, LPCGUID)':
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1826:22: error: 'void operator delete [](void*)' called on unallocated object 'stackBuffer' [-Werror=free-nonheap-object]
   1826 |             delete[] buffer;
        |                      ^~~~~~
  /runtime/artifacts/obj/coreclr/Linux.x64.Debug/vm/eventing/eventpipe/eventpipe/dotnetruntimeprivate.cpp:1814:10: note: declared here
   1814 |     BYTE stackBuffer[32];
        |          ^~~~~~~~~~~

gcc 10 and 9 or clang don't have this analyzer. @janvorli, this seems like a good error to keep un-suppressed.


and a slightly related warning (from same family of new warnings I think):

  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/codeversion.cpp.o
  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/contractimpl.cpp.o
  /runtime/src/coreclr/vm/ceeload.cpp: In member function 'void Module::InitializeDynamicILCrst()':
  /runtime/src/coreclr/vm/ceeload.cpp:1662:16: error: 'void operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
   1662 |         delete pCrst;
        |                ^~~~~
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: note: returned from 'static void* Crst::operator new(size_t)'
   1658 |     Crst * pCrst = new Crst(CrstDynamicIL, CrstFlags(CRST_UNSAFE_ANYMODE | CRST_DEBUGGER_THREAD));
        |                                                                                                 ^
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: error: 'void operator delete(void*)' called on pointer returned from a mismatched allocation function [-Werror=mismatched-new-delete]
  /runtime/src/coreclr/vm/ceeload.cpp:1658:97: note: returned from 'static void* Crst::operator new(size_t)'
  [ 28%] Building CXX object vm/wks/CMakeFiles/cee_wks_core.dir/__/corhost.cpp.o
am11 commented 3 years ago

related gcc 11 warnings have been disabled: https://github.com/dotnet/runtime/blob/790cd89065746c3e27e95d657d301217dbb61021/eng/native/configurecompiler.cmake#L402-L404

AaronRobinsonMSFT commented 3 years ago

@am11 I believe this is the same or similar to https://github.com/dotnet/runtime/issues/57577.

am11 commented 3 years ago

@AaronRobinsonMSFT, yup, these seem to be similar reports of incorrect pairing of {de}allocators in different places. The good news is that the three gcc 11 warnings are highlighting all places under coreclr on Linux, so we don't have to wait for the next mismatch to be identified by someone. I think if we un-suppress these three warnings and fix all the reported instances, it will fix both issues and make allocations usages robust.

am11 commented 3 years ago

@josalem, while going through these compiler warnings, I noticed that in dotnetruntimeprivate.cpp we have generated functions like:

extern "C" ULONG EventPipeWriteEventExecExeEnd(
    LPCGUID ActivityId,
    LPCGUID RelatedActivityId)
{
    if (!EventPipeEventEnabledExecExeEnd())
        return ERROR_SUCCESS;

    BYTE stackBuffer[32];
    BYTE *buffer = stackBuffer;
    size_t offset = 0;
    size_t size = 32;
    bool fixedBuffer = true;
    bool success = true;

    if (!success)
    {
        if (!fixedBuffer)
            delete[] buffer;
        return ERROR_WRITE_FAULT;
    }

    EventPipeAdapter::WriteEvent(EventPipeEventExecExeEnd, (BYTE *)buffer, (unsigned int)offset, ActivityId, RelatedActivityId);

    if (!fixedBuffer)
        delete[] buffer;

    return ERROR_SUCCESS;
}

In this case if (!success) and if (!fixedBuffer) are dead blocks and we seem to have a memory leak as we never delete the buffer (because there was no call to WriteToBuffer. I am not sure what is the bug, either:

josalem commented 3 years ago

I think I'm missing what is leaking. The buffer is stack allocated here, no? I would really hope the compiler would remove the dead code blocks and reduce this to just the call to WriteEvent.

You're right though, those appear to be dead code.

am11 commented 3 years ago

I think I'm missing what is leaking.

Ah right, then why delete[] buffer at all? it's UB..

hope the compiler would remove the dead code blocks and reduce this to just the call to WriteEvent.

Yes compiler has enough mileage that it will produce fairly highly optimized code, but during the compilation different level of analysis are warning about various issues especially in exported methods, symbols of which we explicitly want to keep (tautological mistakes, dead code, deletion of unallocated buffer (which in grand scheme of things, tautologically never true but still incorrect in its own scope)) etc. is something we should fix. We have just suppressed all such warnings from all supported compilers, rather than fixing the source generator genEventPipe.py and friends, which is likely how the real bugs slipped away.

josalem commented 3 years ago

100% agree. Based on the structure of the generated code, those if statements should be something we know the answer to at generation time.

Ah right, then why delete[] buffer at all? it's UB..

Yeah... I saw that when I was reading the snippet. Luckily, it shouldn't hit the delete[] due to the pre-baked if statements, but still, it's troubling.

I think this is all pointing to the need to go through the generation scripts and update them.

janvorli commented 3 years ago

I have looked into fixing the delete[] / new discrepancy. I believe the problem is caused by the fact that the new (std::nothrow) is an operator that we never define ourselves, thus the code always calls the one from the standard libraries. However, we do define the global delete[] operator in the VM. And since VM is linked statically with PAL, the linker links the code with our global delete[]. The fix is to use malloc / free instead. That doesn't have the problem I've mentioned. In fact, I've just remembered that I had to remove couple of usages of delete / new from PAL due to this kind of a problem in the past.

janvorli commented 3 years ago

See the PR where I've removed usages of new / delete here: https://github.com/dotnet/coreclr/pull/1822

janvorli commented 3 years ago

I would still add the DeleteBuffer function though. I have also noticed that for some reason, the genEventPipe.py uses BYTE and the other char for the buffer. It seems it would make sense to unify them. And then there will be no need for the extra ResizeBuffer implementation in the genEventPipe.py generated code. It could use the same one from the eventproviderhelpers.cpp.

josalem commented 3 years ago

new (std::nothrow) is an operator that we never define ourselves

Is this a conscious decision or has it just never been needed before?

When I search for it in the runtime, I see lots of uses of new (nothrow). They seem to mostly be in the DAC and GC, but there are a few in VM as well.

janvorli commented 3 years ago

That is not std::nothrow, but rather our own "nothrow" constant defined here: https://github.com/dotnet/runtime/blob/c496700368804a6aae45ca98ecd846ba2aafd8eb/src/coreclr/utilcode/clrhost_nodependencies.cpp#L216 and declared here: https://github.com/dotnet/runtime/blob/c496700368804a6aae45ca98ecd846ba2aafd8eb/src/coreclr/inc/new.hpp#L17

As such, it is not available in the PAL.

davmason commented 2 years ago

@am11 @janvorli I am going through bugs, I don't see that a fix (at least not the one discussed here) was made, but when I attempt to recreate I do not see the same same error. I am setting DOTNET_EnableEventLog=1 and running a random C# app with a debug corerun.

Is there something I'm missing?

am11 commented 2 years ago

@davmason, thanks for testing it, great news! I forgot about this issue. :)

On linux-x64 machine, I cannot reproduce it with latest main branch. I quickly tested it with https://github.com/dotnet/runtime/commit/5bef921d41e1dc6832c61d2d6e7cae13dc35a2db (from top post; the state of main, when I originally found it) and reproduced on the same machine.

It is very possible that it has been fixed during the SDL compliance work by @AaronRobinsonMSFT and others (https://github.com/dotnet/runtime/issues/66154).

Closing.