Yelp / pyleus

Pyleus is a Python framework for developing and launching Storm topologies.
Apache License 2.0
403 stars 107 forks source link

bolt stopped/quitted after `fail` a tup #141

Closed imcom closed 9 years ago

imcom commented 9 years ago

I am testing the exclamation_bolt but with a little modification like below:

I added a bolt after the existing one, and it will fail the tup depending on the work length

if len(tup.values[0]) % 2 == 0:
            log.debug('try to fail tup and see what\'s happening')
            self.fail(tup)

        word = tup.values[0] + "+++"

What I've been observing is that when the fail fired, bolt continued with the rest of code and afterwards it was just vanished or stopped. I would never see the second bolt running again.

here is the yaml definition:

topology:

    - spout:
        name: words
        module: exclamation_topology.test_word_spout
        parallelism_hint: 1

    - bolt:
        name: exclaim1
        module: exclamation_topology.exclamation_bolt
        parallelism_hint: 1
        groupings:
            - shuffle_grouping: words

    - bolt:
        name: exclaim2
        module: exclamation_topology.random_fail_bolt
        parallelism_hint: 1
        groupings:
            - shuffle_grouping: exclaim1

Either I get the fail mechanism all wrong or this is a critical bug in pyleus ... please help me to work it out.

Thanks in advance

imcom commented 9 years ago

Anyone here mind this sever bug? Whenever a fail(tup) is called, the calling bolt stops and will not run again

poros commented 9 years ago

It's a bit difficult to debug without seeing the code of the new bolt and the traceback of the error (if any).

My guess is that you are using SimpleBolt (because you copied and modified the code from the exclamation_bolt in the example), but you are trying to fail a tuple by yourself.

SimpleBolt will automatically ack any tuple which didn't trigger an exception during the execution of process_tuple() https://github.com/Yelp/pyleus/blob/develop/pyleus/storm/bolt.py#L177

The fail() method does not stop the execution of process_tuple(), but just sends a message to let Storm know that the tuple has been failed. https://github.com/Yelp/pyleus/blob/develop/pyleus/storm/bolt.py#L70

This means you are saying to Storm that the tuple is both failed and acked. This is probably causing havoc in such a way that your bolt hangs indefinitely. If I am correct, inherits from Bolt instead of SimpleBolt and fail/ack tuple by yourself.

As I said at the beginning, this is just a guess, though.

imcom commented 9 years ago

turns out, the Pyleus bolts are very sensitive to receive/send buffer. Ever since I added below options to topology and using the numbers shown, I've not seen bolts hang in topology topology.executor.receive.buffer.size: 16384 topology.executor.send.buffer.size: 16384 topology.transfer.buffer.size: 32 Though @poros 's thought is most likely true also and there may be a deeper issue down the path when fail and ack the same tup or fail then return from process_tup in SimpleBolt may also cause strange behaviour..