MicrosoftResearch / Naiad

The Naiad system provides fast incremental and iterative computation for data-parallel workloads
http://microsoftresearch.github.io/Naiad/
Apache License 2.0
515 stars 88 forks source link

Exception in auto-generated serialization code #20

Open utaal opened 9 years ago

utaal commented 9 years ago

Hi,

The following exception occasionally takes down our computation. We don't really know how to reproduce it in isolation as it happens only when running on large input datasets and after a few minutes of execution. I am not yet sure that our computation is fully deterministic when run multiple times on the same input (we're verifying this now) so the fact that this happens occasionally may be due to non-deterministic behaviour in our code. "ParsedInputPono" (renamed) is a plain-old-NET-object with a few public fields (of type long, int, string) and a constructor that only copies the arguments to the respective fields. We're running a debug build on linux with mono (I'll try to rerun with --debug to get line numbers) on about 16 machines with 4 workers each (-t 4). Can you help us out?

Thank you.

00:07:58.2570129, Graph 0 failed on scheduler 3 with exception:
System.NullReferenceException: Object reference not set to an instance of an object
  at Microsoft.Research.Naiad.Serialization.AutoGenerated.ParsedInputPono.TrySerializeMany (Microsoft.Research.Naiad.Serialization.SubArray`1& destination, ArraySegment`1 values) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Serialization.SendBufferPage.WriteElements[ParsedInputPono] (NaiadSerialization`1 serializer, ArraySegment`1 elements) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Serialization.AutoSerializedMessageEncoder`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].WriteElements (ArraySegment`1 records, Int32 srcVertexId) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Serialization.AutoSerializedMessageEncoder`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].Write (ArraySegment`1 records, Int32 srcVertexId) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.Channels.RemoteMailbox`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].Send (Message`2 message, ReturnAddress from) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.Channels.BufferingPostbox`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].FlushBuffer (Int32 index) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.Channels.BufferingPostbox`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].InternalSend (Message`2 records) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.Channels.BufferingPostbox`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].Send (Message`2 records) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBuffer`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].Send (Message`2 message) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].SendBuffer () [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime`2[ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].Send (ParsedInputPono record) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Frameworks.Lindi.ExtensionMethods+SelectManyVertex`3[System.String,ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].OnReceive (Message`2 message) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.StandardVertices.UnaryVertex`3[System.String,ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch].<MakeStage>m__0 (Message`2 message, Microsoft.Research.Naiad.Dataflow.StandardVertices.UnaryVertex`3 vertex) [0x00000] in <filename unknown>:0 
  at (wrapper delegate-invoke) System.Action`2<Microsoft.Research.Naiad.Dataflow.Message`2<string, Microsoft.Research.Naiad.Dataflow.Epoch>, Microsoft.Research.Naiad.Dataflow.StandardVertices.UnaryVertex`3<string, ParsedInputPono, Microsoft.Research.Naiad.Dataflow.Epoch>>:invoke_void_T1_T2 (Microsoft.Research.Naiad.Dataflow.Message`2<string, Microsoft.Research.Naiad.Dataflow.Epoch>,Microsoft.Research.Naiad.Dataflow.StandardVertices.UnaryVertex`3<string, ParsedInputPono, Microsoft.Research.Naiad.Dataflow.Epoch>)
  at Microsoft.Research.Naiad.Dataflow.Stage`2+<NewInput>c__AnonStorey2`1+<NewInput>c__AnonStorey3[Microsoft.Research.Naiad.Dataflow.StandardVertices.UnaryVertex`3[System.String,ParsedInputPono,Microsoft.Research.Naiad.Dataflow.Epoch],Microsoft.Research.Naiad.Dataflow.Epoch,System.String].<>m__0 (Message`2 m) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.ActionReceiver`2+<ActionReceiver>c__AnonStorey0[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].<>m__0 (Message`2 m, ReturnAddress u) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.ActionReceiver`2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].OnReceive (Message`2 message, ReturnAddress from) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.Channels.PipelineChannel`2+Fiber[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].Send (Message`2 records) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBuffer`2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].Send (Message`2 message) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime`2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].SendBuffer () [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBufferPerTime`2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].Send (System.String record) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Dataflow.StreamingInputVertex`1[System.String].PerformAction (WorkItem workItem) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Scheduling.Scheduler+WorkItem.Run () [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Scheduling.Scheduler.Schedule (WorkItem workItem) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Scheduling.Scheduler.RunWorkItem (Int32 graphId) [0x00000] in <filename unknown>:0 
  at Microsoft.Research.Naiad.Scheduling.Scheduler.RunNotification (Int32 computationIndex) [0x00000] in <filename unknown>:0 
frankmcsherry commented 9 years ago

The code that seems to be crashing is auto-generated serialization code. Naiad should be producing this code as a file as a byproduct of the compilation process, and if you can snag a copy for us to check out that would help a bunch. Although the filenames will be a bit junky, if you can find the one with the corresponding method ParsedInputPono.TrySerializeMany(), that would be super helpful!

frankmcsherry commented 9 years ago

And just to double-check, is the "plain old .NET object" a struct or a class? A class may cause problems, as I believe the generated code just stuffs the fields with deserialized values, and doesn't go out of its way to ensure that the destination is allocated.

utaal commented 9 years ago

I've checked, the "plain old .NET object" is a struct. I'm not familiar with how compile-time code generation works in .net: can you please give an hint on where (how) to find the generated code? After recompiling (either with mono or Visual Studio Community) I've greped for TrySerializeMany in the entire project and the only matches were binary files (pdbs and dlls), documentation (xmls) and the generation code (AutoSerialization.cs).

frankmcsherry commented 9 years ago

So I'm trying to figure this out. On OS X it seems that it uses somewhat randomized temporary folder names, in /var/folders/, rather than in /tmp/ where I expected. If you make sure to set

Logging.LogLevel = LoggingLevel.Debug;
Logging.LogStyle = LoggingStyle.Console;

Then it should produce some output looking like:

12/9/2014 2:06:39 PM, 10527, Compiling codegen for Microsoft.Research.Naiad.Dataflow.IterationIn`1[[Microsoft.Research.Naiad.Dataflow.Epoch, Microsoft.Research.Naiad, Version=0.4.2.0, Culture=neutral, PublicKeyToken=null]]
12/9/2014 2:06:39 PM, 10729, Assembly: /var/folders/jw/smpnp67x2456_grxdp910tmm0000gn/T/g4bui8ua.u65/DebugObjects/3c803c63.dll

and if you pull out the directory from that, in this case

/var/folders/jw/smpnp67x2456_grxdp910tmm0000gn/T/g4bui8ua.u65/DebugObjects/

you should be able to see the .cs files it has generated. I did this with release_0.4, as a caveat, but I think things should be pretty similar (although the serialization code has changed, I think our logging of it probably hasn't).

utaal commented 9 years ago

Thanks for your help, I did not think about looking in /tmp or /var. It looks like the generator no longer logs the path of the generated objects. However I've managed to snatch a copy from our test machines (running Linux) from /tmp/.../DebugSource. This gist (https://gist.github.com/utaal/3de8ccb84ba215a76759) contains a copy of the original struct and the generated serialization code (I renamed the struct and the fields because it is possible the names could leak some information that we're not allowed to share - I'm probably being overcautious here, so if the rename gets in the way while trying to figure this out, just let me know and I'll try to get you a verbatim copy).

I just noticed there are unsafe section in the generated code: I should probably also mention we're experiencing occasional segfaults of the mono runtime (that we initially ascribed to bugs in mono itself).

frankmcsherry commented 9 years ago

That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null.

The two things I can think of are:

  1. We're sending in a bad subarray. That should be easy for us to test for / disconfirm.
  2. The String fields are being mutated. Nothing in C# prevents this, and if there is any reason why someone else might have a finger on the array, that could lead to a String being swapped out for null mid-serialization. If there is a bug in our buffer pooling code, or some buffer pooling that you folks are doing, that could lead to this issue. We could protect against it by capturing the String field before doing the null tests and serialization, but it would still be a bug.

I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow.

michaelisard commented 9 years ago

I am happy to help. There is an implicit contract that you don’t mutate a buffer (or object) after you have sent it in a Naiad channel, and I suppose it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug…

If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar?

If the exception happens when you run with a single process and a single thread, that definitely eliminates some possible bugs with mutating objects. Is that the case, or does it only happen when you run multi-threaded?

From: Frank McSherry [mailto:notifications@github.com] Sent: Wednesday, December 10, 2014 10:06 AM To: MicrosoftResearch/Naiad Subject: Re: [Naiad] Exception in auto-generated serialization code (#20)

That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null.

The two things I can think of are:

  1. We're sending in a bad subarray. That should be easy for us to test for / disconfirm.
  2. The String fields are being mutated. Nothing in C# prevents this, and if there is any reason why someone else might have a finger on the array, that could lead to a String being swapped out for null mid-serialization. If there is a bug in our buffer pooling code, or some buffer pooling that you folks are doing, that could lead to this issue. We could protect against it by capturing the String field before doing the null tests and serialization, but it would still be a bug.

I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow.

— Reply to this email directly or view it on GitHubhttps://github.com/MicrosoftResearch/Naiad/issues/20#issuecomment-66495187.

michaelisard commented 9 years ago

Sorry, as Frank points out you definitely won’t have the bug single-process single-thread, since it won’t do serialization in that case… However 2-process single-thread each vs 2-process 2-thread each is still an interesting distinction.

From: Michael Isard Sent: Wednesday, December 10, 2014 10:12 AM To: 'MicrosoftResearch/Naiad'; MicrosoftResearch/Naiad Subject: RE: [Naiad] Exception in auto-generated serialization code (#20)

I am happy to help. There is an implicit contract that you don’t mutate a buffer (or object) after you have sent it in a Naiad channel, and I suppose it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug…

If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar?

If the exception happens when you run with a single process and a single thread, that definitely eliminates some possible bugs with mutating objects. Is that the case, or does it only happen when you run multi-threaded?

From: Frank McSherry [mailto:notifications@github.com] Sent: Wednesday, December 10, 2014 10:06 AM To: MicrosoftResearch/Naiad Subject: Re: [Naiad] Exception in auto-generated serialization code (#20)

That's a bunch of code! So, the only objects that might be null dereferences in TrySerializeMany (upon inspection) are the string fields of the struct (assuming we haven't passed a SubArray in with a null array; this could be possible, but surprising). It seems that the accesses to .Length, which are the things that would assert, are all guarded by testing that the string is non-null.

The two things I can think of are:

  1. We're sending in a bad subarray. That should be easy for us to test for / disconfirm.
  2. The String fields are being mutated. Nothing in C# prevents this, and if there is any reason why someone else might have a finger on the array, that could lead to a String being swapped out for null mid-serialization. If there is a bug in our buffer pooling code, or some buffer pooling that you folks are doing, that could lead to this issue. We could protect against it by capturing the String field before doing the null tests and serialization, but it would still be a bug.

I'll chat with Michael about a good way to work on this. I'm not allowed (I think) to contribute to the repo, for reasons of MSFT not accepting IP from non-MSFT folks. But, we'll try to get you some patched code with some diagnostics somehow.

— Reply to this email directly or view it on GitHubhttps://github.com/MicrosoftResearch/Naiad/issues/20#issuecomment-66495187.

frankmcsherry commented 9 years ago

In the meantime, I'll start a review of the places we use buffer pooling!

frankmcsherry commented 9 years ago

So a quick look at our buffer pooling suggests:

  1. Just about all of the pooling is thread-local. This sort of pooling would be unlikely to result in races.
  2. However, there is a rebalancing that happens, lines 306-310 and lines 320-324 in BufferPool.cs. You could try commenting those out, at which point I think a race on typed buffers should be "impossible", since the pooling would be strictly thread-local at that point. The lines of code above look pretty harmless, but ... concurrency is hard and stuff. And, if we did stash a second copy of an array, the rebalancing would open up races, so could be a problem here (and you could comment out the rebalance to test).
  3. You could try editing Message's Allocate and Release methods to just allocate an array and drop it on the floor, respectively. These would be lines 74 and 90 in Channel.cs. This would rule out buffer pooling as a source of the null pointer dereference.

Another helpful bit of information would be to build the program with debug information, and run it with --debug flag, which the mono folks suggest will reveal line numbers (http://www.mono-project.com/docs/debug+profile/debug/). I don't know that this works for auto-generated code, but it would be good to check out.

I'll keep looking at how we're using the pooling to see if anything leaps out at me, but it's mostly sane looking so far, without additional clues.

frankmcsherry commented 9 years ago

Additional information: there is a counterpart to Michael's implicit contract about not changing something once you've sent it, which is that you should not hold on to a Message<S, T> or its .payload after OnRecv returns. Naiad assumes that it owns the message and will go off and use it for something else (including stashing it in a pool). Unlike messing with a buffer once it has been sent, this is much easier to do. No idea if it is what is going on, but I thought I would throw it out there.

utaal commented 9 years ago

Thank you both for your help. Regarding the contract:

it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug.

This struct is only used at the very beginning of the computation, as follows:

csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine))
                    .NextStep(...)

where csvLineInput is a BatchedDataSource<string> and SelectMany is Naiad's Lindi.SelectMany.

there is a counterpart to Michael's implicit contract about not changing something once you've sent it, which is that you should not hold on to a Message<S, T> or its .payload after OnRecv returns. Naiad assumes that it owns the message and will go off and use it for something else (including stashing it in a pool).

NextStep.OnRecv is in fact modifying all the payloads and then storing them in a List for later use (they will be combined and flushed in a later OnNotify). We have a List<RenamedStruct> items and we run something equivalent to:

            for (int i = 0; i < message.length; ++i) {
                message.payload[i].field17 = message.time.epoch;
                items.Add(message.payload[i]);
            }

While I believe items.Add(message.payload[i]); should be fine (as the struct is copied when passed as an argument), message.payload[i].field17 = message.time.epoch; is most likely an issue. I'll change our code so we copy the struct before modifying it.

If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar?

Another helpful bit of information would be to build the program with debug information, and run it with --debug flag, which the mono folks suggest will reveal line numbers (http://www.mono-project.com/docs/debug+profile/debug/).

The Exception does not happen on local test runs, for which we use a reduced input dataset and where debugging is feasible. It took a few tries for us to be able to run with some stability with the full input dataset (we're pushing around 50k input lines every second, with a pretty heavy computation) and it is still unclear whether the system is able to sustain the load for an extended period of time. If avoiding messing with the payload does not solve the issue, I'll try and run with --debug and see if we can stay up long enough to see the error again.

Thank you again for your help. I know I haven't addressed all your comments, I'll get back to you tomorrow morning (we're in UTC+1).

michaelisard commented 9 years ago

In fact, just modifying payload[i].field17 shouldn’t be an issue: the memory won’t be reused until after OnRecv returns so the danger is holding on to the memory after that, not within the call, and as you say copying into the list seems like it should prevent problems.

It’s pretty mysterious to me what could be happening. Looking back at your original stack trace it’s the send from SelectMany that is crashing, so I don’t think it’s likely to be a problem with NextStep. And SelectMany has been used a lot :) It would certainly be very helpful to know what line is generating the null dereference. It looks as if the problem should be triggered by a program that just does

csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine)).Distinct();

so it might be worth testing with that just to reduce the sources of error.

From: Andrea Lattuada [mailto:notifications@github.com] Sent: Wednesday, December 10, 2014 12:15 PM To: MicrosoftResearch/Naiad Cc: Michael Isard Subject: Re: [Naiad] Exception in auto-generated serialization code (#20)

Thank you both for your help. Regarding the contract:

it’s conceivable your code is violating that. However if you are using structs, and just passing them to a VertexOutputBufferForTime object, it’s hard to see how it could be your fault since the struct will be copied, not passed by reference. As Frank suggests, this might be a Naiad bug.

This struct is only used at the very beginning of the computation, as follows:

csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine))

                .NextStep(...)

where csvLineInput is a BatchedDataSource and SelectMany is Naiad's Lindi.SelectMany.

there is a counterpart to Michael's implicit contract about not changing something once you've sent it, which is that you should not hold on to a Message<S, T> or its .payload after OnRecv returns. Naiad assumes that it owns the message and will go off and use it for something else (including stashing it in a pool).

NextStep.OnRecv is in fact modifying all the payloads and then storing them in a List for later use (they will be combined and flushed in a later OnNotify). We have a List items and we run something equivalent to:

        for (int i = 0; i < message.length; ++i) {

            message.payload[i].field17 = message.time.epoch;

            items.Add(message.payload[i]);

        }

While I believe items.Add(message.payload[i]); should be fine (as the struct is copied when passed as an argument), message.payload[i].field17 = message.time.epoch; is most likely an issue. I'll change our code so we copy the struct before modifying it.

If I were using Visual Studio for this I would ask it to break on thrown exceptions, and run the program single threaded (-t 1) in the debugger, at which point it would break at the time of the dereference in the generated code. I am afraid I have never used Mono: is there any way to do something similar?

Another helpful bit of information would be to build the program with debug information, and run it with --debug flag, which the mono folks suggest will reveal line numbers (http://www.mono-project.com/docs/debug+profile/debug/).

The Exception does not happen on local test runs, for which we use a reduced input dataset and debugging is feasible. It took a few tries for us to be able to run with some stability with the full input dataset (we're pushing around 50k input lines every second, with a pretty heavy computation) and it is still unclear whether the system is able to sustain the load for an extended period of time. If avoiding messing with the payload does not solve the issue, I'll try and run with --debug and see if we can stay up long enough to see the error again.

Thank you again for your help. I know I haven't addressed all your comments, I'll get back to you tomorrow morning (we're in UTC+1).

— Reply to this email directly or view it on GitHubhttps://github.com/MicrosoftResearch/Naiad/issues/20#issuecomment-66516167.

utaal commented 9 years ago

In fact, copying RenamedStruct before modifying it doesn't solve the issue. I've managed to run the system with --debug and I have a trace with line numbers:

Unhandled Exception:
System.Exception: Error during Naiad execution ---> System.NullReferenceException: Object reference not set to an instance of an object
  at Microsoft.Research.Naiad.Serialization.AutoGenerated.RenamedStruct.TrySerializeMany (Microsoft.Research.Naiad.Serialization.SubArray1 values) [0x00582] in /tmp/2o0xdqc3.yx2/DebugObjects/288a7e0f.0.cs:562
  at Microsoft.Research.Naiad.Serialization.SendBufferPage.WriteElements[RenamedStruct] (NaiadSerialization1 elements) [0x00018] in /Users/utaal/Src/ETH/dsl/Naiad/Channels/IOHelper.cs:233
  at Microsoft.Research.Naiad.Serialization.AutoSerializedMessageEncoder1 records, Int32 srcVertexId) [0x00028] in /Users/utaal/Src/ETH/dsl/Naiad/Channels/IOHelper.cs:1744
  at Microsoft.Research.Naiad.Serialization.AutoSerializedMessageEncoder1 records, Int32 srcVertexId) [0x0000d] in /Users/utaal/Src/ETH/dsl/Naiad/Channels/IOHelper.cs:1730
  at Microsoft.Research.Naiad.Dataflow.Channels.RemoteMailbox2 message, ReturnAddress from) [0x00044] in /Users/utaal/Src/ETH/dsl/Naiad/Channels/MessageDelivery.cs:681
  at Microsoft.Research.Naiad.Dataflow.Channels.BufferingPostbox2[DCModel.RenamedStruct,Microsoft.Research.Naiad.Dataflow.Epoch].FlushBuffers () [0x0000a] in /Users/utaal/Src/ETH/dsl/Naiad/Channels/MessageDelivery.cs:260
  at Microsoft.Research.Naiad.Dataflow.Channels.BufferingPostbox2[DCModel.RenamedStruct,Microsoft.Research.Naiad.Dataflow.Epoch].Flush () [0x00073] in /Users/utaal/Src/ETH/dsl/Naiad/Frameworks/StandardVertices.cs:242
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBuffer2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].Flush () [0x00018] in /Users/utaal/Src/ETH/dsl/Naiad/Dataflow/Endpoints.cs:250
  at Microsoft.Research.Naiad.Dataflow.Channels.PipelineChannel2[System.String,Microsoft.Research.Naiad.Dataflow.Epoch].Flush () [0x00073] in /Users/utaal/Src/ETH/dsl/Naiad/Frameworks/StandardVertices.cs:242
  at Microsoft.Research.Naiad.Dataflow.VertexOutputBuffer1[System.String].PerformAction (WorkItem workItem) [0x002e0] in /Users/utaal/Src/ETH/dsl/Naiad/Dataflow/Input.cs:510
  at Microsoft.Research.Naiad.Scheduling.Scheduler+WorkItem.Run () [0x0000d] in /Users/utaal/Src/ETH/dsl/Naiad/Runtime/Scheduling/Scheduler.cs:126
  at Microsoft.Research.Naiad.Scheduling.Scheduler.Schedule (WorkItem workItem) [0x00043] in /Users/utaal/Src/ETH/dsl/Naiad/Runtime/Scheduling/Scheduler.cs:185
  at Microsoft.Research.Naiad.Scheduling.Scheduler.RunWorkItem (Int32 graphId) [0x001f1] in /Users/utaal/Src/ETH/dsl/Naiad/Runtime/Scheduling/Scheduler.cs:447
  at Microsoft.Research.Naiad.Scheduling.Scheduler.RunNotification (Int32 computationIndex) [0x00011] in /Users/utaal/Src/ETH/dsl/Naiad/Runtime/Scheduling/Scheduler.cs:387

Unfortunately mono exited cleanly and /tmp/2o0xdqc3.yx2/DebugObjects/288a7e0f.0.cs was deleted, however I checked that the generated files we have in /tmp for RenamedStruct are identical, so I'm pretty sure that the line causing the issue is 562, https://gist.github.com/utaal/3de8ccb84ba215a76759#file-generated-renamedstruct-cs-L562 .

Note that this code is successfully (de)serializing tens of thousands of structs before failing and the failure does not seem deterministic.

frankmcsherry commented 9 years ago

Thanks, this is very helpful.

In the interest of you making progress in the meantime, I recommend disabling the buffer pooling by changing the lines in the Message struct's allocate and release methods. If we're right that it is a buffer pooling issue (and, it seems likely, given that I can't see why that could would assert otherwise) this should avoid it for you at the cost of spending a bit more time playing with the heap. I'll see if I can get some better info about what might be the cause, or some more diagnostics to put in there to help us pinpoint the root cause.

frankmcsherry commented 9 years ago

Actually, I'm sorry, I've changed my mind. :)

Reading the code more closely, it seems that the first thing that it does around the loop is make a copy of the struct. So, even if there was a buffer pooling issue, at this point we would have read some struct and it should be immutable for the rest of the loop.

The only thing I can think otherwise is unsafe code, and it is totally possible that some of our unsafe stuff has glitches, and accidentally stomps on memory it shouldn't. Are you still seeing roughly as many mono segfaults as before?

michaelisard commented 9 years ago

A few things are still confusing me. I don’t understand Mono or precisely how its exceptions are generated, but naively a NullReferenceException at that line could only be triggered if value.field11 is null. The only way that could happen given the control flow is if another thread is mutating value.field11. But since value is stack allocated, it’s not even heap corruption… It is conceivable that there is a memory error that repeatedly trashes a thread stack in the same place but never causes other crashes, but it seems quite hard to construct.

Of course it may be that the Mono exception doesn’t exactly mean what I think. I am surprised by the stack trace you included: did you delete some lines for obfuscation purposes? Otherwise there must have been some optimizations (the PipelineChannel holding a string calls straight into the VertexOutputBuffer holding a RenamedStruct) which might be misleading us. Is this a debug or release build?

My next step would be to simplify as much as possible. Can you try the program that just goes straight from the SelectMany to a Distinct and see if you can trigger the error? Otherwise, are there any other parts of the program that do anything that ought to make us suspicious, that might be causing memory corruption that shows up here? Do you use any serialized objects anywhere or are they all structs? Do you happen to use a struct containing a short anywhere? (Frank just found what looks like a bug in the short serialization code.)

Also, I would definitely try running –t 1 in each process: if you can cause a crash that way it rules out a whole bunch of potential bugs.

From: Frank McSherry [mailto:notifications@github.com] Sent: Thursday, December 11, 2014 7:58 AM To: MicrosoftResearch/Naiad Cc: Michael Isard Subject: Re: [Naiad] Exception in auto-generated serialization code (#20)

Actually, I'm sorry, I've changed my mind. :)

Reading the code more closely, it seems that the first thing that it does around the loop is make a copy of the structhttps://gist.github.com/utaal/3de8ccb84ba215a76759#file-generated-renamedstruct-cs-L416. So, even if there was a buffer pooling issue, at this point we would have read some struct and it should be immutable for the rest of the loop.

The only thing I can think otherwise is unsafe code, and it is totally possible that some of our unsafe stuff has glitches, and accidentally stomps on memory it shouldn't. Are you still seeing roughly as many mono segfaults as before?

— Reply to this email directly or view it on GitHubhttps://github.com/MicrosoftResearch/Naiad/issues/20#issuecomment-66639922.

utaal commented 9 years ago

The only thing I can think otherwise is unsafe code, and it is totally possible that some of our unsafe stuff has glitches, and accidentally stomps on memory it shouldn't. Are you still seeing roughly as many mono segfaults as before?

Our jobs last around 20 minutes and we can't parallelize, so I can't really say I have a good sample size but yes - mono is still randomly segfaulting.

Of course it may be that the Mono exception doesn’t exactly mean what I think. I am surprised by the stack trace you included: did you delete some lines for obfuscation purposes? Otherwise there must have been some optimizations (the PipelineChannel holding a string calls straight into the VertexOutputBuffer holding a RenamedStruct) which might be misleading us. Is this a debug or release build?

I've double-checked and the stack trace is verbatim with the exception of the name of the struct. The trace is then repeated in the output with the header 00:10:28.8248451, Cancelling execution of graph 0, due to exception: and followed by 00:10:28.8879019, Broadcasting graph failure message. This is a debug build ran with --debug.

My next step would be to simplify as much as possible. Can you try the program that just goes straight from the SelectMany to a Distinct and see if you can trigger the error? Also, I would definitely try running –t 1 in each process: if you can cause a crash that way it rules out a whole bunch of potential bugs.

Working on these now.

utaal commented 9 years ago

Running with csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine)).Distinct().Subscribe(...) and -t 1 resulted in a SIGSEGV:

Stacktrace:

Native stacktrace:

        mono() [0x4b310c]
        mono() [0x50ac5f]
        mono() [0x423427]
        /lib/x86_64-linux-gnu/libpthread.so.0(+0xf030) [0x7f50acc65030]

Debug info from gdb:

Mono support loaded.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
[New Thread 0x7f50a42be700 (LWP 2435)]
[New Thread 0x7f50a42ff700 (LWP 2434)]
[New Thread 0x7f509edff700 (LWP 2432)]
[New Thread 0x7f509f7ff700 (LWP 2431)]
[New Thread 0x7f50a44d5700 (LWP 2430)]
[New Thread 0x7f509fbff700 (LWP 2429)]
[New Thread 0x7f50a46d6700 (LWP 2428)]
[New Thread 0x7f50a48d7700 (LWP 2427)]
[New Thread 0x7f50a57d7700 (LWP 2426)]
/tmp/buildd/gdb-7.4.1+dfsg/gdb/dwarf2-frame.c:683: internal-error: Unknown CFI encountered.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Quit this debugging session? (y or n) [answered Y; input not from terminal]
/tmp/buildd/gdb-7.4.1+dfsg/gdb/dwarf2-frame.c:683: internal-error: Unknown CFI encountered.
A problem internal to GDB has been detected,
further debugging may prove unreliable.
Create a core file of GDB? (y or n) [answered Y; input not from terminal]

=================================================================
Got a SIGSEGV while executing native code. This usually indicates
a fatal error in the mono runtime or one of the native libraries
used by your application.
=================================================================

I'll try running again to see if I can get a non-native Exception.

utaal commented 9 years ago

One more run (-t 1) resulted in a new inexplicable NullPointerException, this time in our code: https://gist.github.com/utaal/ae01eaa91ef6bccaba51 Note that field14, https://gist.github.com/utaal/3de8ccb84ba215a76759#file-renamedstruct-cs-L45, should never be null as https://gist.github.com/utaal/3de8ccb84ba215a76759#file-renamedstruct-cs-L149-L163 would fail with an array out-of-bounds error if string[] values didn't contain enough values. We're still only running csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine)).Distinct().Subscribe(...).

We have access to a few Windows VMs, I'll try and see if we can run this on .net to potentially rule out a bug somewhere in mono.

michaelisard commented 9 years ago

If you could get the problem to manifest in a Windows VM that would definitely help, since at this point the program is simple enough that we should be able to repro it with purely synthetic data and I can debug from my end.

From: Andrea Lattuada [mailto:notifications@github.com] Sent: Thursday, December 11, 2014 3:00 PM To: MicrosoftResearch/Naiad Cc: Michael Isard Subject: Re: [Naiad] Exception in auto-generated serialization code (#20)

One more run (-t 1) resulted in a new inexplicable NullPointerException, this time in our code: https://gist.github.com/utaal/ae01eaa91ef6bccaba51 Note that field14, https://gist.github.com/utaal/3de8ccb84ba215a76759#file-renamedstruct-cs-L45, should never be null as https://gist.github.com/utaal/3de8ccb84ba215a76759#file-renamedstruct-cs-L149-L163 would fail with an array out-of-bounds error if string[] values didn't contain enough values. We're still only running csvLineInput.SelectMany (csvLine => RenamedStruct.createAndSetNull(csvLine)).Distinct().Subscribe(...).

We have access to a few Windows VMs, I'll try and see if we can run this on .net to potentially rule out a bug somewhere in mono.

— Reply to this email directly or view it on GitHubhttps://github.com/MicrosoftResearch/Naiad/issues/20#issuecomment-66705187.