Open ah- opened 8 years ago
What do you think about the following two hunches?
When SafeKafkaHandle.ReleaseHandle is called via the SafeHandle finalizer, these can be called in random order. Hence a producer can be destroyed before a topic. Basically, given that finalization order is non deterministic, is it possible that it violates proper librdkafka termination? Finalizers are allowed to run for no more than 2 seconds. If destroying any handle takes more, can this cause any issues? If the above are risks, then i think we do need to explicitly Dispose. What do you think?
The first one sounds like the most likely explanation. Due to GC being random we might call librdkafka cleanup functions in the wrong order, which then causes the crash. Alternatively it might be a genuine librdkafka crash, I think we could find out by doing the same thing in C without the C# wrapper and see if it crashes there as well.
The C# wrapper objects should keep references that try to ensure that librdkafka is cleaned up in the right order. E.g. https://github.com/ah-/rdkafka-dotnet/blob/master/src/RdKafka/Internal/SafeKafkaHandle.cs#L97 keeps a reference to the Producer, which should ensure that the Producer only gets destroyed after all related Topics are gone.
If you have time to debug, I'd start by putting logging into the SafeKafkaHandle and so on and see if they do get called in the wrong order.
Good idea to create a separate issue.
Is the producer going to wait for outstanding requests to be transmitted and handled? I am not sure RdKafka.Handle.Dispose gets called during finalizations.
This page suggests that one should first wait for outstanding requests: https://github.com/edenhill/librdkafka/wiki/Proper-termination-sequence
That must be it! Handle should implement the full Dispose() pattern (https://msdn.microsoft.com/en-us/library/b1yfkh5e(v=vs.110).aspx) then I think this should stop happening.
Although, reading the docs it seems like it shouldn't crash if you don't wait, so maybe that's a librdkafka issue.
If i modify the sample above to keep references to the topic and producers and Disposing them after the outer loop (i accumulate two lists) then the application does not seem to crash (so far).
Will you be updating the library to do additional cleanup?
Update: if alternatively i accumulate the tasks returned by produce and wait on all of them before the topic+producer reference goes out of scope, it still crashes (eventually).
The other thing, is that i cannot reproduce the crash if i do not call Produce in Parallel.
Correction - just done that now. See next post.
If i add some logging to SafeTopicHandle.ReleaseHandle and SafeKafkaHandle.ReleaseHandle and actually run everything single threaded (so in the above sample code, replace Parallel.For(0, Count, _ => topic.Produce(null)) with for (var i=0;i<Count;++i) topic.Produce(null);) i see the following output before it crashes:
SafeKafkaHandle.ReleaseHandle does:
var n = Counter.Next();
Console.WriteLine($"\t\t{n} Enter.ReleaseHandle.Producer");
LibRdKafka.destroy(handle);
Console.WriteLine($"\t\t{n} Exit.ReleaseHandle.Producer");
SafeTopicHandle.ReleaseHandle does:
var n = Counter.Next();
Console.WriteLine($"\t\t\t{n} Enter.ReleaseHandle.SafeTopicHandle");
LibRdKafka.topic_destroy(handle);
Console.WriteLine($"\t\t\t{n} Middle.ReleaseHandle.SafeTopicHandle");
// See SafeKafkaHandle.Topic
kafkaHandle.DangerousRelease();
Console.WriteLine($"\t\t\t{n} Exit.ReleaseHandle.SafeTopicHandle");
static class Counter { private static int _next; public static int Next() => ++_next; }
The last producer to be freed never writes "Exit.ReleaseHandle.Producer". Does this look like a RdKafka issue?
Done: 0.
1 Enter.ReleaseHandle.SafeTopicHandle
1 Middle.ReleaseHandle.SafeTopicHandle
1 Exit.ReleaseHandle.SafeTopicHandle
Done: 1.
2 Enter.ReleaseHandle.SafeTopicHandle
2 Middle.ReleaseHandle.SafeTopicHandle
2 Exit.ReleaseHandle.SafeTopicHandle
Done: 2.
3 Enter.ReleaseHandle.SafeTopicHandle
3 Middle.ReleaseHandle.SafeTopicHandle
3 Exit.ReleaseHandle.SafeTopicHandle
Done: 3.
4 Enter.ReleaseHandle.SafeTopicHandle
4 Middle.ReleaseHandle.SafeTopicHandle
4 Exit.ReleaseHandle.SafeTopicHandle
Done: 4.
5 Enter.ReleaseHandle.SafeTopicHandle
5 Middle.ReleaseHandle.SafeTopicHandle
5 Exit.ReleaseHandle.SafeTopicHandle
Done: 5.
6 Enter.ReleaseHandle.SafeTopicHandle
6 Middle.ReleaseHandle.SafeTopicHandle
6 Exit.ReleaseHandle.SafeTopicHandle
Done: 6.
Done: 7.
7 Enter.ReleaseHandle.SafeTopicHandle
7 Middle.ReleaseHandle.SafeTopicHandle
7 Exit.ReleaseHandle.SafeTopicHandle
8 Enter.ReleaseHandle.SafeTopicHandle
8 Middle.ReleaseHandle.SafeTopicHandle
8 Exit.ReleaseHandle.SafeTopicHandle
Done: 8.
9 Enter.ReleaseHandle.SafeTopicHandle
9 Middle.ReleaseHandle.SafeTopicHandle
9 Exit.ReleaseHandle.SafeTopicHandle
Done: 9.
Done.
10 Enter.ReleaseHandle.Producer
10 Exit.ReleaseHandle.Producer
11 Enter.ReleaseHandle.Producer
11 Exit.ReleaseHandle.Producer
12 Enter.ReleaseHandle.Producer
12 Exit.ReleaseHandle.Producer
13 Enter.ReleaseHandle.Producer
13 Exit.ReleaseHandle.Producer
14 Enter.ReleaseHandle.Producer
14 Exit.ReleaseHandle.Producer
15 Enter.ReleaseHandle.Producer
15 Exit.ReleaseHandle.Producer
16 Enter.ReleaseHandle.Producer
16 Exit.ReleaseHandle.Producer
17 Enter.ReleaseHandle.Producer
17 Exit.ReleaseHandle.Producer
18 Enter.ReleaseHandle.Producer
18 Exit.ReleaseHandle.Producer
19 Enter.ReleaseHandle.SafeTopicHandle
19 Middle.ReleaseHandle.SafeTopicHandle
20 Enter.ReleaseHandle.Producer
Notice how there is no matching "19 ExitReleaseHandle.SafeTopicHandle" and "20 Exit.ReleaseHandle.Producer"
Do you think that Handle.StartCallbackTask can be doing a poll while the handle is finalized?
1.If I comment out the poll from the task then it doesn't crash. 2.Alternatively, if i just stop the poll task before letting the application exit (I added a new function called Stop) then it doesn't crash.
Unfortunately i cannot prove that poll is being called while finalized.
Eden does say that you should not call any other API while destroying: https://github.com/edenhill/librdkafka/issues/725
Thanks for all the debugging, this was really helpful!
I think this is fixed now in https://github.com/ah-/rdkafka-dotnet/commit/a0c53430a018753ef69c535ec5d126bbfc497bc3, by cancelling the polling task in the finalizer, which should guarantee it doesn't poll after shutting down the librdkafka objects.
I had a bit of a hard time reproducing the issue reliably, could you check if you still see the issue?
Thank you. Running it now looks like it doesn't crash. But then this happened :/
System.Runtime.InteropServices.SEHException: External component has thrown an exception.
at RdKafka.Internal.LibRdKafka.NativeMethods.rd_kafka_poll(IntPtr rk, IntPtr timeout_ms)
at RdKafka.Handle.<>c__DisplayClass27_0.
Not sure if this is RdKafka or librdkafka. Is it completely safe to call the CTS and task from the finalizer? Though non of the classes have finalizers, can Microsoft change that? Also maybe it is a good idea to Dispose the callbackTask and callbackCts in Dispose(bool)?
Might be something transient as it is running for a couple of hours now and all is fine.
The test is: 1.Compile the above sample to an exe. 2.A batch file runs the exe, waits for it to finish and runs another one in an infinite loop.
It has been running fine for a couple of hours now and that exception has yet to re-occur. Maybe the error above was a fluke?
What do you think about disposing the two instances of CTS and task? Also Microsoft probably wont add finalizers after the fact (as that would screw all existing finalizers).
I think the error was not a fluke. I believe that the finalization order was at play here. The Handle finalizer can run after the finalizers for SafeTopicHandle and SafeKafkaHanle.
A->B - indicates that A has a reference to B. {A,B} - indicates both A and B
The following object graph can be determined: Producer->~Handle->{CTS,Task,SafeKafkaHanle} (continued) {CTS,Task}->SafeKafkaHanle (continued) SafeKafkaHanle->SafeHandleZeroIsInvalid->~SafeHandle
Topic->SafeTopicHandle (continued) {Topic,SafeTopicHandle}->SafeKafkaHanle
Looks like the finalizer(~) for Handle can run after the finalizer for SafeHandle. If that happens then the callbackTask can execute one more time and access SafeKafkaHanle.
You're right about the issue, it's certainly real and has most likely to do with cleanup order.
I think it's because there's no guaranteed order in which the finalizers run. Using DangerousAddRef to express the dependency would be one possible solution, but I'm not yet sure if that's the best one.
I have somewhat limited time to spend on this right now, but I'll look into it as soon as I can. On what platform are you seeing the issue? .net framework (not core) on Windows?
I'm running this on Windows 7. .NET 4.6 is installed but i compile it for 4.5.2.
Yes unless all unmanaged handles are aggregated under one class (Handle, SafeTopicHandle, SafeKafkaHandle) then DangerousAddRef is a solution. In principle the code can be refactored under the hood such that there is only one owner of SafeTopicHandle[] and SafeKafkaHandle. The this type could do cleanup. But i wonder if for a large number of SafeTopicHandles the finalize will take more than 2 seconds (in which case .NET will not like that).
How about a rough solution based on this:
Add this first line to Handle.Dispose(bool)
//Check for null in case Init threw.
if (callbackTask==null || Interlocked.CompareExchange(ref stoppedCallbackTask, 1, 0) == 1) return;
Handle.Init does:
callbackTask = StartCallbackTask(callbackCts.Token);
handle.stopCallbackTask = () => Dispose(false);
SafeKafkaHandle.Topic does:
topicHandle.kafkaHandle = this;
topicHandle.stopCallbackTask = stopCallbackTask;
See https://github.com/ah-/rdkafka-dotnet/pull/49
Running
sometimes crashes with "ntdll!NtWaitForSingleObject" and "A heap has been corrupted."