Crash when shutting down without disposing

ah- commented 8 years ago

See https://github.com/ah-/rdkafka-dotnet/pull/49

Running

private static void Runner()
        {
            const int tries = 10;
            for (var i = 0; i < tries; ++i)
            {
                var producer = new Producer(new Config
                {
                    ["compression.codec"] = "snappy",
                    ["heartbeat.interval.ms"] = "10000",
                    ["session.timeout.ms"] = "30000",
                    ["topic.metadata.refresh.sparse"] = "true"
                }, broker);
                var topic = producer.Topic("Abc", new TopicConfig
                {
                    ["request.required.acks"] = "-1",
                    ["message.timeout.ms"] = "60000"
                });
                Parallel.For(0, Count, _ => topic.Produce(null));
                Console.WriteLine($"Done: {i}.");
            }
            Console.WriteLine("Done.");
        }

sometimes crashes with "ntdll!NtWaitForSingleObject" and "A heap has been corrupted."

ah- commented 8 years ago

What do you think about the following two hunches?

When SafeKafkaHandle.ReleaseHandle is called via the SafeHandle finalizer, these can be called in random order. Hence a producer can be destroyed before a topic. Basically, given that finalization order is non deterministic, is it possible that it violates proper librdkafka termination? Finalizers are allowed to run for no more than 2 seconds. If destroying any handle takes more, can this cause any issues? If the above are risks, then i think we do need to explicitly Dispose. What do you think?

The first one sounds like the most likely explanation. Due to GC being random we might call librdkafka cleanup functions in the wrong order, which then causes the crash. Alternatively it might be a genuine librdkafka crash, I think we could find out by doing the same thing in C without the C# wrapper and see if it crashes there as well.

The C# wrapper objects should keep references that try to ensure that librdkafka is cleaned up in the right order. E.g. https://github.com/ah-/rdkafka-dotnet/blob/master/src/RdKafka/Internal/SafeKafkaHandle.cs#L97 keeps a reference to the Producer, which should ensure that the Producer only gets destroyed after all related Topics are gone.

If you have time to debug, I'd start by putting logging into the SafeKafkaHandle and so on and see if they do get called in the wrong order.

MaximGurschi commented 8 years ago

Good idea to create a separate issue.

Is the producer going to wait for outstanding requests to be transmitted and handled? I am not sure RdKafka.Handle.Dispose gets called during finalizations.

This page suggests that one should first wait for outstanding requests: https://github.com/edenhill/librdkafka/wiki/Proper-termination-sequence

ah- commented 8 years ago

That must be it! Handle should implement the full Dispose() pattern (https://msdn.microsoft.com/en-us/library/b1yfkh5e(v=vs.110).aspx) then I think this should stop happening.

Although, reading the docs it seems like it shouldn't crash if you don't wait, so maybe that's a librdkafka issue.

MaximGurschi commented 8 years ago

If i modify the sample above to keep references to the topic and producers and Disposing them after the outer loop (i accumulate two lists) then the application does not seem to crash (so far).

Will you be updating the library to do additional cleanup?

Update: if alternatively i accumulate the tasks returned by produce and wait on all of them before the topic+producer reference goes out of scope, it still crashes (eventually).

MaximGurschi commented 8 years ago

The other thing, is that i cannot reproduce the crash if i do not call Produce in Parallel.

Correction - just done that now. See next post.

MaximGurschi commented 8 years ago

If i add some logging to SafeTopicHandle.ReleaseHandle and SafeKafkaHandle.ReleaseHandle and actually run everything single threaded (so in the above sample code, replace Parallel.For(0, Count, _ => topic.Produce(null)) with for (var i=0;i<Count;++i) topic.Produce(null);) i see the following output before it crashes:

SafeKafkaHandle.ReleaseHandle does:

                    var n = Counter.Next();
                    Console.WriteLine($"\t\t{n} Enter.ReleaseHandle.Producer");
                    LibRdKafka.destroy(handle);
                    Console.WriteLine($"\t\t{n} Exit.ReleaseHandle.Producer");

SafeTopicHandle.ReleaseHandle does:

                    var n = Counter.Next();
                    Console.WriteLine($"\t\t\t{n} Enter.ReleaseHandle.SafeTopicHandle");
                    LibRdKafka.topic_destroy(handle);
                    Console.WriteLine($"\t\t\t{n} Middle.ReleaseHandle.SafeTopicHandle");
                    // See SafeKafkaHandle.Topic
                    kafkaHandle.DangerousRelease();
                    Console.WriteLine($"\t\t\t{n} Exit.ReleaseHandle.SafeTopicHandle");

static class Counter { private static int _next; public static int Next() => ++_next; }

The last producer to be freed never writes "Exit.ReleaseHandle.Producer". Does this look like a RdKafka issue?

Done: 0.
                        1 Enter.ReleaseHandle.SafeTopicHandle
                        1 Middle.ReleaseHandle.SafeTopicHandle
                        1 Exit.ReleaseHandle.SafeTopicHandle
Done: 1.
                        2 Enter.ReleaseHandle.SafeTopicHandle
                        2 Middle.ReleaseHandle.SafeTopicHandle
                        2 Exit.ReleaseHandle.SafeTopicHandle
Done: 2.
                        3 Enter.ReleaseHandle.SafeTopicHandle
                        3 Middle.ReleaseHandle.SafeTopicHandle
                        3 Exit.ReleaseHandle.SafeTopicHandle
Done: 3.
                        4 Enter.ReleaseHandle.SafeTopicHandle
                        4 Middle.ReleaseHandle.SafeTopicHandle
                        4 Exit.ReleaseHandle.SafeTopicHandle
Done: 4.
                        5 Enter.ReleaseHandle.SafeTopicHandle
                        5 Middle.ReleaseHandle.SafeTopicHandle
                        5 Exit.ReleaseHandle.SafeTopicHandle
Done: 5.
                        6 Enter.ReleaseHandle.SafeTopicHandle
                        6 Middle.ReleaseHandle.SafeTopicHandle
                        6 Exit.ReleaseHandle.SafeTopicHandle
Done: 6.
Done: 7.
                        7 Enter.ReleaseHandle.SafeTopicHandle
                        7 Middle.ReleaseHandle.SafeTopicHandle
                        7 Exit.ReleaseHandle.SafeTopicHandle
                        8 Enter.ReleaseHandle.SafeTopicHandle
                        8 Middle.ReleaseHandle.SafeTopicHandle
                        8 Exit.ReleaseHandle.SafeTopicHandle
Done: 8.
                        9 Enter.ReleaseHandle.SafeTopicHandle
                        9 Middle.ReleaseHandle.SafeTopicHandle
                        9 Exit.ReleaseHandle.SafeTopicHandle
Done: 9.
Done.
                10 Enter.ReleaseHandle.Producer
                10 Exit.ReleaseHandle.Producer
                11 Enter.ReleaseHandle.Producer
                11 Exit.ReleaseHandle.Producer
                12 Enter.ReleaseHandle.Producer
                12 Exit.ReleaseHandle.Producer
                13 Enter.ReleaseHandle.Producer
                13 Exit.ReleaseHandle.Producer
                14 Enter.ReleaseHandle.Producer
                14 Exit.ReleaseHandle.Producer
                15 Enter.ReleaseHandle.Producer
                15 Exit.ReleaseHandle.Producer
                16 Enter.ReleaseHandle.Producer
                16 Exit.ReleaseHandle.Producer
                17 Enter.ReleaseHandle.Producer
                17 Exit.ReleaseHandle.Producer
                18 Enter.ReleaseHandle.Producer
                18 Exit.ReleaseHandle.Producer
                        19 Enter.ReleaseHandle.SafeTopicHandle
                        19 Middle.ReleaseHandle.SafeTopicHandle
                20 Enter.ReleaseHandle.Producer

MaximGurschi commented 8 years ago

Notice how there is no matching "19 ExitReleaseHandle.SafeTopicHandle" and "20 Exit.ReleaseHandle.Producer"

MaximGurschi commented 8 years ago

Do you think that Handle.StartCallbackTask can be doing a poll while the handle is finalized?

1.If I comment out the poll from the task then it doesn't crash. 2.Alternatively, if i just stop the poll task before letting the application exit (I added a new function called Stop) then it doesn't crash.

Unfortunately i cannot prove that poll is being called while finalized.

Eden does say that you should not call any other API while destroying: https://github.com/edenhill/librdkafka/issues/725

ah- commented 8 years ago

Thanks for all the debugging, this was really helpful!

I think this is fixed now in https://github.com/ah-/rdkafka-dotnet/commit/a0c53430a018753ef69c535ec5d126bbfc497bc3, by cancelling the polling task in the finalizer, which should guarantee it doesn't poll after shutting down the librdkafka objects.

I had a bit of a hard time reproducing the issue reliably, could you check if you still see the issue?

MaximGurschi commented 8 years ago

Thank you. Running it now looks like it doesn't crash. But then this happened :/

System.Runtime.InteropServices.SEHException: External component has thrown an exception. at RdKafka.Internal.LibRdKafka.NativeMethods.rd_kafka_poll(IntPtr rk, IntPtr timeout_ms) at RdKafka.Handle.<>c__DisplayClass27_0.b__0() at System.Threading.Tasks.Task.Execute() --- End of inner exception stack trace --- at System.Threading.Tasks.Task.ThrowIfExceptional(Boolean includeTaskCanceledExceptions) at System.Threading.Tasks.Task.Wait(Int32 millisecondsTimeout, CancellationToken cancellationToken) at RdKafka.Handle.Dispose(Boolean disposing) at RdKafka.Handle.Finalize()

Not sure if this is RdKafka or librdkafka. Is it completely safe to call the CTS and task from the finalizer? Though non of the classes have finalizers, can Microsoft change that? Also maybe it is a good idea to Dispose the callbackTask and callbackCts in Dispose(bool)?

MaximGurschi commented 8 years ago

Might be something transient as it is running for a couple of hours now and all is fine.

The test is: 1.Compile the above sample to an exe. 2.A batch file runs the exe, waits for it to finish and runs another one in an infinite loop.

MaximGurschi commented 8 years ago

It has been running fine for a couple of hours now and that exception has yet to re-occur. Maybe the error above was a fluke?

What do you think about disposing the two instances of CTS and task? Also Microsoft probably wont add finalizers after the fact (as that would screw all existing finalizers).

MaximGurschi commented 8 years ago

I think the error was not a fluke. I believe that the finalization order was at play here. The Handle finalizer can run after the finalizers for SafeTopicHandle and SafeKafkaHanle.

A->B - indicates that A has a reference to B. {A,B} - indicates both A and B

The following object graph can be determined: Producer->~Handle->{CTS,Task,SafeKafkaHanle} (continued) {CTS,Task}->SafeKafkaHanle (continued) SafeKafkaHanle->SafeHandleZeroIsInvalid->~SafeHandle

Topic->SafeTopicHandle (continued) {Topic,SafeTopicHandle}->SafeKafkaHanle

Looks like the finalizer(~) for Handle can run after the finalizer for SafeHandle. If that happens then the callbackTask can execute one more time and access SafeKafkaHanle.

ah- commented 8 years ago

You're right about the issue, it's certainly real and has most likely to do with cleanup order.

I think it's because there's no guaranteed order in which the finalizers run. Using DangerousAddRef to express the dependency would be one possible solution, but I'm not yet sure if that's the best one.

I have somewhat limited time to spend on this right now, but I'll look into it as soon as I can. On what platform are you seeing the issue? .net framework (not core) on Windows?

MaximGurschi commented 8 years ago

I'm running this on Windows 7. .NET 4.6 is installed but i compile it for 4.5.2.

Yes unless all unmanaged handles are aggregated under one class (Handle, SafeTopicHandle, SafeKafkaHandle) then DangerousAddRef is a solution. In principle the code can be refactored under the hood such that there is only one owner of SafeTopicHandle[] and SafeKafkaHandle. The this type could do cleanup. But i wonder if for a large number of SafeTopicHandles the finalize will take more than 2 seconds (in which case .NET will not like that).

How about a rough solution based on this:

Drop the Handle destructor

Add this first line to Handle.Dispose(bool)

    //Check for null in case Init threw.
    if (callbackTask==null || Interlocked.CompareExchange(ref stoppedCallbackTask, 1, 0) == 1) return;

Handle.Init does:

   callbackTask = StartCallbackTask(callbackCts.Token);
   handle.stopCallbackTask = () => Dispose(false);

SafeKafkaHandle.Topic does:

   topicHandle.kafkaHandle = this;
   topicHandle.stopCallbackTask = stopCallbackTask;

ReleaseHandle of both SafeKafkaHandle and SafeKafkaHandle call stopCallbackTask?.Invoke() first.

ah- / rdkafka-dotnet

Crash when shutting down without disposing #50