WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 68 forks source link

Segfault after "this will never happen" failure. #2237

Open rachelblasucci opened 6 years ago

rachelblasucci commented 6 years ago

Following on from #2236 (and also potentially #2232, #2234, as these all create recovery files in /tmp).

3) Now, modify code. Update the name of the reverse function to, e.g. reverse2 (both here and application_setup):

@wallaroo.computation(name="reverse")
def reverse2(data):
    return data[::-1]

4) Re-run application twice. The first time will fail with the same "invalid start byte" error. The second run will produce a segfault, with this:

****CLUSTERING MODE is active****
****This is an enterprise feature. You may need to obtain a paid usage agreement to use this feature in production. See the Wallaroo Community License at https://github.com/WallarooLabs/wallaroo/blob/master/LICENSE.md for details, and also please visit the page at http://www.wallaroolabs.com/pricing****
****AUTOSCALE MODE is active****
****This is an enterprise feature. You may need to obtain a paid usage agreement to use this feature in production. See the Wallaroo Community License at https://github.com/WallarooLabs/wallaroo/blob/master/LICENSE.md for details, and also please visit the page at http://www.wallaroolabs.com/pricing****
||| Resilience directory: /tmp|||
Single worker topology
Recovering from recovery files!
Set up external channel listener on 127.0.0.1:5050
Running as Initializer...
recover_worker_names: initializer
Restarting a listener ...

---------------------------------------------------------
|v|v|v|Initializing Local Topology|v|v|v|

Segmentation fault
JONBRWN commented 6 years ago

@rachelreese were there any other changes you made to reverse to cause the segfault? following the instructions from #2236 and this issue, I could not get the "invalid start byte" error after recovering from the files produced by the application Fail in #2236. I noticed the output in #2236 has the following GRAPH: "| reverse2New source | |reverse2New||" --> "reverse2New sink" which suggests that there were some other changes, possibly beyond just the names.

JONBRWN commented 6 years ago

after getting a reproducible segfault, I added the following to startup.pony's _setup_shutdown_handler function:

ifdef not "resilience" then
  SignalHandler(WallarooShutdownHandler(c, r, a), Sig.segv())
end

what then occurred was that the program appeared to be in an infinite loop, constantly retrying to re-run the segfault causing code block so the SignalHandler never entered the apply function for the WallarooShutdownHandler.

the segfault occurs in the @pony_deserialise function, specifically in pony_deserialise_offset

rachelblasucci commented 6 years ago

Hmm. So, it did take me several tries to reproduce it. I went back to the original code after being able to reproduce it, until I was just down to these steps, but that doesn't mean the system as a whole reset itself. (e.g. the recovery files, and anything else that might stick around like that.) I can make a point of trying a fresh system again, to see if there are additional steps.