Closed tpitale closed 5 years ago
You say the stage is crashing, do you have the error report? Stacktrace and what not? Also, what is your GenStage version? Thank you.
gen_stage v 0.14.1
The error is:
GenServer :commands_consumer terminating
** (stop) no connection
Last message: {:DOWN, #Reference<0.2662159242.2400714754.68107>, :process, {:commands_notifier, :"node_name@ip_address"}, :noconnection}
Right. So everything is happening "according to the plan". The default mode is subscription mode is permanent, which means is the subscription breaks, either because of a disconnection, the producer crashes, etc, then the consumer will terminate too.
The big question here is what are you expecting to happen then.
I guess I'm fine with letting the Supervisor restart my ConsumerSupervisor … except that it logs this error every time.
If I had an opportunity to log my own message and track a stat I'd be happy. It seemed like the way to do that was to prevent the crash of the ConsumerSupervisor when I know why it's happening (noconnection
).
I really can't see anything we could do here. If the supervisor is going down with a crash, then it is going to log it. :(
I guess I'm not sure why the ConsumerSupervisor has to crash when I know why it is crashing.
This remote node connection handling seems to be like a common pattern … the need to re-connect.
Can I try a different subscription mode?
Like, if I switch to temporary
will the underlying subscription try to reconnect to the remote node?
I guess I'm not sure why the ConsumerSupervisor has to crash when I know why it is crashing.
Exactly so its supervisor can pick it up and restart it.
There are some cases where let it crash points you towards things you want to fix: i.e. the error was caused to a bug in logic and you want to avoid it in the future. But there are also operational cases, like above, where you want exactly to crash the process and start it all over again. In this case the crash is desired and therefore the logging will accompany it.
If you want, you can change the mode to temporary when you do the subscription, which means that you expect the producer to crash and the consumer will continue running in those cases. But resubscription is not automatic, so you have to detect those crashes yourself and reconnect the consumer to the producer. It feels like you would be reimplementing the supervisor, but it is your call.
So, if I change it to temporary, would I receive that {:DOWN …}
message? Or would I get the handle_cancel
?
Thanks so much for your help, by the way!
No, it is a supervisor, generally speaking you can implement only the init
callback of a supervisor (since allowing custom logic in the other callbacks of a supervisor could make it less reliable). So would need a separate process that would also monitor the producer, receive the DOWN and resubscribe.
If your only concern is the message being logged, can't you ignore or discard it in your log processing pipeline? Maybe you can even add some metadata to make it easier to filter?
Okay, so that point is one that someone made in slack. The remote subscriber should just be a ConsumerProducer and the the ConsumerSupervisor can subscribe to that producer, locally.
It's nice to know that Supervisors should not handle other messages. Thanks!
@josevalim I changed this to be a Consumer only, without being a supervisor. I still can't seem to use handle_info
on the DOWN message. Is that expected behavior, too?
Yes, because the DOWN message is from a monitor setup by GenStage. If you want to hook into it, then you need to implement handle_cancel. You can also do your own monitor.
Btw, since this is not an issue with GenStage, the issue I will go ahead and close the issue again.
I can't seem to prevent my
ConsumerSupervisor
(subscribing to a remote producer) from crashing when the last message is{:DOWN, _ref,:process, _thing, :noconnection}
.I've tried adding a
handle_info
for the message and returning{:noreply, state}
so that I can log a warning and increment a stat.The
ConsumerSupervisor
is subscribing to a remote node which seems to be the source of the:noconnection
reason.Looking at this code here: https://github.com/elixir-lang/gen_stage/blob/a16cd70c280e41029cf397d468ab08be24142379/lib/gen_stage.ex#L1866 I can't determine if this applies in the case of a
ConsumerSupervisor
.I would expect either handle_cancel to be called https://github.com/elixir-lang/gen_stage/blob/a16cd70c280e41029cf397d468ab08be24142379/lib/consumer_supervisor.ex#L406 or perhaps for the
handle_info
message to be passed to my module.Thanks for any info/help!