AaronRobinsonMSFT / DNNE

Prototype native exports for a .NET Assembly.
MIT License
397 stars 41 forks source link

Parallel app calls into .net via DNNE can cause stall during dll load at first calls #119

Closed mkrmasch closed 2 years ago

mkrmasch commented 2 years ago

Havin an application that does parallel calls into the DNNE/.net unmanage/mangaged stack, I sometimes face some strange behavior of a stalled app that does not return from an initial call into the C-API. From the debugger I can at least see that the problem exists outside the managed code. I can also workaround this issue if I avoid a parallel call at the first time which seems to indicate that the problem is kind of a race condition when loading the dll into memory and having concurrent calls at this moment.

Does anyone have an idea if it would be possible to avoid that in the unmanaged layer already?

AaronRobinsonMSFT commented 2 years ago

@mkrmasch What version of DNNE are you using? This was discovered recently and fixed in https://github.com/AaronRobinsonMSFT/DNNE/pull/112 (1.0.30).

mkrmasch commented 2 years ago

Actually we are using 1.0.30 and found this issue

AaronRobinsonMSFT commented 2 years ago

@mkrmasch Well that is unfortunate. Can you please share the platform details you are using (OS, CPU) and the stacktrace captured in the debugger?

AaronRobinsonMSFT commented 2 years ago

Does anyone have an idea if it would be possible to avoid that in the unmanaged layer already?

In the meantime the CLR can be preloaded manually.

https://github.com/AaronRobinsonMSFT/DNNE/blob/9395005537941423963d788eb18d6cdf313c0a43/src/platform/dnne.h#L92-L104

mkrmasch commented 2 years ago

Hey, actually it is hard to provide a useful stack trace the app hangs inside the create call and if I break in the debugger I do not end up with grafik

I tryed to use DNNE_CALLTYPE try_preload_runtime() but this is kind of instable and causes the app to hang if I do parallel calls

AaronRobinsonMSFT commented 2 years ago

@mkrmasch I am going to assume that mgcore_SH_19_0.dll is the binary generated by DNNE. However, I don't believe there is any call to those APIs in DNNE generated code. This is an odd issue if that is the stack being observed. The PDB for the generated binary should be copied to the output so can you confirm version 1.0.30 is being used? Perhaps this is an assert and that is how it is implemented on Win32 - that would be surprising to me though.

AaronRobinsonMSFT commented 2 years ago

I will try to reproduce this locally today.

AaronRobinsonMSFT commented 2 years ago

@mkrmasch Yep, I found it. Thanks for reporting this. I will put out a new package shortly.

AaronRobinsonMSFT commented 2 years ago

@mkrmasch Package 1.0.31 has been published https://www.nuget.org/packages/DNNE/1.0.31. Thank you for reporting this issue.

mkrmasch commented 2 years ago

Hey @AaronRobinsonMSFT , that worked, thank you so much!