WallarooLabs / wally

Distributed Stream Processing
https://www.wallaroolabs.com
Apache License 2.0
1.48k stars 68 forks source link

Machida3 segfaults when `combine` returns `None` #2725

Open pzel opened 5 years ago

pzel commented 5 years ago

Is this a bug, feature request, or feedback?

Bug / UI improvemnt.

What is the current behavior?

I have the following Aggregation class in an app that uses the new windowing API:

class MetricsAggregation(wallaroo.Aggregation):
    def initial_accumulator(self):
        return AggregatedMetric.identity()

    def update(self, metric, agg):
        agg.add(metric)

    def combine(self, m1, m2):
        AggregatedMetric.identity().add(m1).add(m2)

    def output(self, key, agg):
        return (key, agg.min, agg.max, sum(agg.latencies), agg.timestamp)

When this app runs, machida3 segfaults silently without printing anything to stdout/err. Examining the core file, I see:

Using host libthread_db library "/usr/lib/libthread_db.so.1".
Core was generated by `machida3 --ponythreads 1 --application-module metrics --cluster-initializer --i'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x000055af7da9f4e0 in py_decref (o=0x0)
    at /home/p/w/wallaroo/machida3/cpp/python-wallaroo.c:211
211       Py_DECREF(o);
[Current thread is 1 (Thread 0x7f63d5787700 (LWP 25748))]
(gdb) bt
#0  0x000055af7da9f4e0 in py_decref (o=0x0)
    at /home/p/w/wallaroo/machida3/cpp/python-wallaroo.c:211
#1  0x000055af7d806956 in Machida_val_dec_ref_oo (this=0x55af7de462d0, o=0x0)
    at machida.pony:870
#2  0x000055af7d8a004e in PyState_box__final_o (this=0x7f63d4bc04e0)
    at machida.pony:175
#3  0x000055af7daa6c76 in sweep_small ()
#4  0x000055af7daa77c6 in ponyint_heap_endgc ()
#5  0x000055af7daa0824 in ponyint_actor_run ()
#6  0x000055af7daa3053 in run_thread ()
#7  0x00007f63e72caf93 in start_thread (arg=<optimized out>)
    at pthread_create.c:486
#8  0x00007f63e66297cf in clone ()
    at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95
(gdb) Quit

This dump is not really helpful in diagnosing my problem. After some trial and error, I realize that the underlying issue is that I've forgotten to return from my `combine' function:

    def combine(self, m1, m2):
        AggregatedMetric.identity().add(m1).add(m2)

should be:

    def combine(self, m1, m2):
        return AggregatedMetric.identity().add(m1).add(m2)

The 'bug' being that nothing in my development experience with the app pointed me in the direction of this fix.

What is the expected behavior?

It would be nice if Wallaroo errored out and provided an explanation of what's expected, like:

ERROR: The function AggregatedMetric.combine did not return an Aggregate object. To address this, ensure that this function returns a new Aggregate.

We could additionally leverage this check internally to make sure that the object being returned from combine is not pointer-equal to either one of the arguments that were passed in.

What OS and version of Wallaroo are you using?

Linux tmr.local 4.19.8_1 #1 SMP PREEMPT Sat Dec 8 19:05:10 UTC 2018 x86_64 GNU/Linux

Wallaro branch metrics-app

Steps to reproduce?

  1. Check out wallaroo branch metrics-app
  2. cd examples/python/alerts_windowed
  3. Make start-app
  4. Confirm that the machida3 process running metrics.py has crashed
  5. Confirm that there is no relevant log message in log/metrics.log
  6. If you have core dumps enabled, confirm that machida3 has dumped core
pzel commented 5 years ago

Discussion summary: 1) Address the None issue with an appropriate error message.

2) Break out the correctness-checking as a separate topic (separate issue). Some ideas: a) "paranoid" mode (like Valgrind) that stores data and performs ongoing run-time validation (e.g. interleaving inputs & comparing results); b) Look at Python3 type annotations and try quick-checking with that info; c) Opt-in pre-flight correctness check.

Point 2. above is not urgent.