exponentially / extreme

Elixir Adapter for EventStore
MIT License
130 stars 31 forks source link

Bad argument in arithmetic expression in Extreme.slice_content/3 #50

Closed janpieper closed 6 years ago

janpieper commented 6 years ago

While processing many of events per second, we are randomly running into a bad argument in arithmetic expression error.

** (EXIT) an exception was raised:
        ** (ArithmeticError) bad argument in arithmetic expression
           (extreme) lib/extreme.ex:432: Extreme.slice_content/3
           (extreme) lib/extreme.ex:425: Extreme.process_package/2
           (extreme) lib/extreme.ex:407: Extreme.handle_info/2
           (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
           (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
           (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
            (elixir) lib/gen_server.ex:774: GenServer.call/3

We've used the Logger.debug from inside Extreme.slice_content to see that data that is coming in:

2018-03-09 12:09:14.407 [error] We have unfinished message of length nil(2): <<36, 0>>
burmajam commented 6 years ago

While searching logs in our environments, I couldn't find such message. It will be hard to reproduce. Can you please give me any clue how to do it. What is order of "processing many of events per second"? It would be great if you could reproduce it in a test. State.should_receive = nil is set in few places and that is situation you are facing with. We've been processing with listeners something around 20k events today and didn't have any exceptions

timbuchwaldt commented 6 years ago

We see this when pushing ~5.000 events in a short timespan to 11 persistent subscriptions, so in the order of 60.000 events in a few minutes flowing through the system.

This is also seemingly not bound to a specific event or anything, sometimes it happens - sometimes it doesn't.

We also saw a different behaviour that might be connected (not sure yet though): Sometimes Eventstore logs a heartbeat problem on one of our subscriptions which kind of smells like a full inbox in one of the subscription processes

janpieper commented 6 years ago

The ArithmeticError occurs after a command has failed with :aggregate_execution_failed:

19:37:02.654 [warn] Example.ProcessManagers.ExampleProcessManager has requested to stop: :aggregate_execution_failed
19:37:02.655 [debug] Extreme event store subscription "process_manager_a" down due to: :shutdown
19:37:02.655 [warn] Subscription to EventStore is down. Will retry in 1000 ms.
19:37:02.655 [debug] Extreme event store subscription "projection_a" down due to: :shutdown
19:37:02.655 [debug] Extreme event store subscription "process_manager_b" down due to: :shutdown
19:37:02.655 [debug] Extreme event store subscription "projection_b" down due to: :shutdown
19:37:02.655 [debug] Extreme event store subscription "projection_c" down due to: :shutdown
19:37:02.655 [debug] Extreme event store subscription "process_manager_c" down due to: :shutdown
19:37:02.656 [debug] Extreme event store subscription "projection_d" down due to: :shutdown
19:37:02.656 [debug] Extreme event store subscription "projection_e" down due to: :shutdown
19:37:02.656 [debug] Extreme event store subscription "projection_f" down due to: :shutdown
19:37:02.656 [debug] Extreme event store subscription "projection_g" down due to: :shutdown
19:37:02.656 [info] Connecting Extreme to eventstore-service.dev:1113
19:37:02.656 [debug] Extreme event store subscription "projection_h" down due to: :shutdown
19:37:02.657 [debug] Locating aggregate process for `Example.Aggregates.Example` with UUID "example-1781b6ad-ddef-5e62-ac16-853eb6e49a5a"
19:37:02.659 [error] GenServer Commanded.EventStore.Adapters.Extreme.EventStore terminating
** (ArithmeticError) bad argument in arithmetic expression
    (extreme) lib/extreme.ex:432: Extreme.slice_content/3
    (extreme) lib/extreme.ex:425: Extreme.process_package/2
    (extreme) lib/extreme.ex:407: Extreme.handle_info/2
    (stdlib) gen_server.erl:616: :gen_server.try_dispatch/4
    (stdlib) gen_server.erl:686: :gen_server.handle_msg/6
    (stdlib) proc_lib.erl:247: :proc_lib.init_p_do_apply/3
mjaric commented 6 years ago

Hi all,

@burmajam If I'm not mistaken, there is one code branch which is never executed and could be the reson why this occur.

If you check this line and look into code, you will see that such match {:unfinished_message, expected_message_length, data} will never be generated in code by procees_message function. Also the 3rd function process_content will take everything, BUT it resets state.should_receive to nil, which is exactly what @janpieper described at the bottom of issue report.

2018-03-09 12:09:14.407 [error] We have unfinished message of length nil(2): <<36, 0>>

So there is still part of message but should_receive is nil.

This is issue where received tcp package was just part of protobuf message. But what is interesting here and odds are a bit low, is probably that extreme received partial message as TCP packages more than once in row, so after reset on line extreme.ex:453 it finally hit this buggy case.

janpieper commented 6 years ago

We were able to reproduce the ArithmeticError in our codebase and @timbuchwaldt created a PR which fixes the issue for us.