WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 68 forks source link

All workers crash when any non-initializer worker recovers after initializer recovered #954

Closed nisanharamati closed 7 years ago

nisanharamati commented 7 years ago

After the initializer worker has recovered from a crash, if any of the other workers crashes and recovers, this leads to a cluster-wide failure.

To reproduce:

  1. build sequence_window
    cd testing/correctness/apps/sequence_window
    stable env ponyc -d -D resilience
    mkdir res-data
  2. start giles receiver
    ../../../../giles/receiver/receiver --ponythreads=1 --ponynoblock \
    --ponypinasio -l 127.0.0.1:5555
  3. start initializer
    ./sequence_window -i 127.0.0.1:7000 -o 127.0.0.1:5555 -m 127.0.0.1:5001 \
    --ponythreads=4 --ponypinasio --ponynoblock -c 127.0.0.1:12500 \
    -d 127.0.0.1:12501 -r res-data -w 2 -n worker1 -t
  4. start worker
    ./sequence_window -i 127.0.0.1:7000 -o 127.0.0.1:5555 -m 127.0.0.1:5001 \
    --ponythreads=4 --ponypinasio --ponynoblock -c 127.0.0.1:12500 \
    -d 127.0.0.1:12501 -r res-data -w 2 -n worker2
  5. start giles sender
    ../../../../giles/sender/sender -h 127.0.0.1:7000 -s 1 -i 50_000_000 \
    --ponythreads=1 -y -g 12 -w -u -m 1000
  6. ctrl-c initializer
  7. restart initializer
  8. ctrl-c and restart giles sender (data validity doesn't matter here, so it doesn't matter if we restart the sequence data from 0)
  9. Note that the application resumed properly
  10. ctrl-c worker
  11. restart worker
  12. everything crashes.
SeanTAllen commented 7 years ago

What exactly is "crashes" @nisanharamati?

I get:

Sent control message to worker2
initializerdata : unable to listenThis should never happen: failure in /Users/sean/code/sendence/wallaroo/lib/wallaroo/data_channel/data_channel_tcp.pony at line 97

on the initializer but the worker is still up. I end up in a loop where nothing can ever start up again though as one then the other fails as they restart and attempt recovery. I get this ping ponging from one worker to another as I restart them:

Recovery error while calling method on Waiting for Boundary Counts Phase
Try restarting the process.
This should never happen: failure in /Users/sean/code/sendence/wallaroo/lib/wallaroo/recovery/recovery_replayer.pony at line 71