Intermittent test failures

neilconway commented 12 years ago

test_fg_crash_shutdown_cb(CallbackTest):
RuntimeError: eventmachine not initialized: evma_open_datagram_socket
    /Users/neilc/.rvm/gems/ruby-1.9.3-p125/gems/eventmachine-0.12.10/lib/eventmachine.rb:915:in `open_udp_socket'
    /Users/neilc/.rvm/gems/ruby-1.9.3-p125/gems/eventmachine-0.12.10/lib/eventmachine.rb:915:in `open_datagram_socket'
    /Users/neilc/Dropbox/School/code/bud/test/tc_execmodes.rb:164:in `test_fg_crash_shutdown_cb'

Doesn't reproduce consistently, unfortunately.

sriram-srinivasan commented 12 years ago

On Mar 16, 2012, at 4:33 AM, Neil Conway wrote:

test_fg_crash_shutdown_cb(CallbackTest):
RuntimeError: eventmachine not initialized: evma_open_datagram_socket
   /Users/neilc/.rvm/gems/ruby-1.9.3-p125/gems/eventmachine-0.12.10/lib/eventmachine.rb:915:in `open_udp_socket'
   /Users/neilc/.rvm/gems/ruby-1.9.3-p125/gems/eventmachine-0.12.10/lib/eventmachine.rb:915:in `open_datagram_socket'
   /Users/neilc/Dropbox/School/code/bud/test/tc_execmodes.rb:164:in `test_fg_crash_shutdown_cb'

It reproduces consistently with a given seed.

The error is somewhere in the fork code, because the actual thing it is testing works fine. This is the behavior of a process to shut down cleanly. I ran AckOnBoot separately and it works just fine. So, there is some eventmachine related weirdness while forking. Is it possible for you to look into it?

--sriram.

p.s. My network connection is agonizingly slow at the moment, and I am unable to even load github (strangely blocks on gravatar, of all things!)

neilconway commented 12 years ago

Yep, looking into this now (thought I had fixed it, but some problems still seem to crop up occasionally). Test cases for correct signal handling are tricky :-\

neilconway commented 12 years ago

Okay, I pushed a few fixes for this that should address the problems -- let me know if you can repro more errors. I see this error in tc_channel:

  1) Failure:
test_loopback_tick(LoopbackTests) [/Users/neilc/Dropbox/School/code/bud/test/tc_channel.rb:375]:
Expected: [["foo", 1]]
  Actual: []

The test seems dubious (hard-coded sleep) -- I'll look into it in a few hours, unless you fix it first :)

sriram-srinivasan commented 12 years ago

On Mar 16, 2012, at 6:31 AM, Neil Conway wrote:

Okay, I pushed a few fixes for this that should address the problems -- let me know if you can repro more errors. I see this error in tc_channel:
 1) Failure:
test_loopback_tick(LoopbackTests) [/Users/neilc/Dropbox/School/code/bud/test/tc_channel.rb:375]:
Expected: [["foo", 1]]
 Actual: []
The test seems dubious (hard-coded sleep) -- I'll look into it in a few hours, unless you fix it first :)

I'll take a look at it as well in a few hours.

It fails reliably on running ts_bud --seed 42893 (but not just tc_channel). Clearly it is a dependency on some leftover state.

--s

sriram-srinivasan commented 12 years ago

1) Failure: test_loopback_tick(LoopbackTests) [/Users/neilc/Dropbox/School/code/bud/test/tc_channel.rb:375]: Expected: [["foo", 1]] Actual: []

This one's stumped me so far. Earlier it was passing if run by itself or as part of tc_channel. Then I saw that it regularly failed when run as part of ts_bud, so I started winnowing other tests to see if there was some cross test interference.

Now I can get it to fail with just tc_channel (with seed 42893), and once it starts failing, it fails reliably every single time. WIth BUD_DEBUG=1, I see that it sends the first message out on the socket, but doesn't receive anything. I'd introduced the sleep assuming a race condition, but 0.5 seconds ought to be enough for a few thousand messages!

--s

neilconway commented 12 years ago

Indeed; even waiting 2 seconds doesn't seem to be enough. The kernel is entitled to drop UDP datagrams at will (even over the loopback interface) -- that's one possible explanation.

neilconway commented 12 years ago

Other non-deterministic failures I've seen: very rarely, tc_halt hangs (in test_halt) -- I suspect a race between shutdown and callback registration. I've also seen this error in tc_callback:

  1) Error:
test_delta(CallbackTest):
Bud::ShutdownWithCallbacksError: Bud instance shutdown before sync_callback completed
    /Users/neilc/Dropbox/School/code/bud/lib/bud.rb:778:in `sync_callback'
    /Users/neilc/Dropbox/School/code/bud/lib/bud.rb:787:in `delta'
    /Users/neilc/Dropbox/School/code/bud/test/tc_callback.rb:137:in `block in test_delta'
    /Users/neilc/.rvm/rubies/ruby-1.9.3-p125/lib/ruby/1.9.1/timeout.rb:68:in `timeout'
    /Users/neilc/Dropbox/School/code/bud/test/tc_callback.rb:137:in `test_delta'

but it also only happens very rarely.

sriram-srinivasan commented 12 years ago

I'm quite sure that this is a race condition.

The EM loop runs in a separate thread, and bud.receive_data is called in that thread. There is no mutex between the test thread that calls tick and checks for inbound messages, and the EM thread. That is the reason there is a problem when tick separately from the test thread, but never when the test runs automatically after run_bg. The fix isn't a simple one like wrapping a mutex around bud.@inbound, because the EM thread calls tick_internal too.

We should go back to using EM without threads.

neilconway commented 12 years ago

Assuming you're talking about the test_loopback_tick() -- I don't see the problem. EM may invoke receive_data at any time, but that just inserts into @inbound. When running in non-async mode (manual ticks), contents of @inbound is only consumed when a tick fires -- so the user code can still safely examine Bud state once the tick() call returns.

neilconway commented 12 years ago

BTW, after fixing the weird behavior in the signal handling thread (032bf1d7412a6548340db9a5fd088e9e678e1619, https://github.com/bloom-lang/bud/commit/13c21537b3af97f0e4a355ed44a110a72b355de4), I can no longer reproduce the test_loopback_tick failure. The test_delta happens still happens but very rarely; I haven't had a chance to track it down yet.

neilconway commented 11 years ago

BTW, this seems to occur from time to time with the latest code. Seems to happen more often with MRI 1.8.

  1) Failure:
test_halt(TestHalt) [./tc_halt.rb:25]:
Bud::ShutdownWithCallbacksError expected but nothing was raised.

bloom-lang / bud

Intermittent test failures #263