WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 69 forks source link

Worker freezes with 'Muting DataChannel' #3118

Closed slfritchie closed 4 years ago

slfritchie commented 4 years ago

Is this a bug, feature request, or feedback?

Bug

What is the current behavior?

In a 2-worker cluster, a modest workload sent to initializer triggers an intermittent behavior in worker1 where worker1 freezes after printing a series of 15 or more Muting DataChannel messages. Work continues on initializer until the point when the data channel(s) from initializer -> worker1 apply back pressure to initializer.

What is the expected behavior?

No freeze

What OS and version of Wallaroo are you using?

Ubuntu Bionic/18.04 LTS + Wallaroo @ commit 35d2038

Steps to reproduce?

  1. Use the instructions in #3117 to set up a 1 or 2 CPU virtual machine, 4GB RAM minimum.
    • You may want to use env PONYCFLAGS="--verbose=1 --debug -Dresilience -Dtrace -Dcheckpoint_trace -Didentify_routing_ids" make when building Machida3.
  2. Use the CSV file at https://gist.githubusercontent.com/slfritchie/065bb9325d1844c581067e90b9dae542/raw/3a33c348527d0c449bf7f6c449bd9ce4969a77ce/3118.csv as the input to the recipe below.
vagrant@ubuntu-bionic:/build2$ reset.sh
WARNING: all useful state files are deleted by this script!

vagrant@ubuntu-bionic:/build2$ start-cluster.sh 2
WARNING: all useful state files are deleted by this script!
Worker initializer: port = 7107
Worker worker1: port = 7117
Success

vagrant@ubuntu-bionic:/build2$ for i in `seq 1 3600`; do /bin/echo -n . ; cat /path/to/3118.csv  | ./frame-text-lines.py | nc -w 1 localhost 7100; done

The bug may take up to an hour before manifesting. See full logs at http://wallaroolabs-dev.s3.amazonaws.com/logs/logs.1583818189.tar.gz. From /tmp/wallaroo.1:

1583816848.074484,_CheckpointEventLogPhase: check_completion() with 40 checkpointed and 51 total
1583816848.077920,Muting DataChannel
1583816848.077938,Muting DataChannel
1583816848.077948,Muting DataChannel
1583816848.077956,Muting DataChannel
1583816848.077964,Muting DataChannel
1583816848.077971,Muting DataChannel
1583816848.077979,Muting DataChannel
1583816848.077987,Muting DataChannel
1583816848.077995,Muting DataChannel
1583816848.078003,Muting DataChannel
1583816848.078010,Muting DataChannel
1583816848.078018,Muting DataChannel
1583816848.078025,Muting DataChannel
1583816848.078033,Muting DataChannel
1583816848.078041,Muting DataChannel
1583816848.078049,Muting DataChannel
1583816848.078057,Muting DataChannel
1583816848.078065,Muting DataChannel
1583816848.078072,Muting DataChannel
1583816848.078080,Muting DataChannel
1583816848.078088,Muting DataChannel