Open engelsanchez opened 10 years ago
I forgot to add a sample entry from the crash log. Here it is https://gist.github.com/engelsanchez/1fa788ec68cfec4f3e98
Also, please consider renaming riak_cs_block_server:start_block_servers to riak_cs_block_server:start_link_block_servers or something that indicates that it is start_linking a process. Just grepping for 'link' in the get_fsm only uncovered the manifest_fsm pid.
Good idea @Vagabond.
The get FSM may be linked to a manifest FSM and a reader process (potential for many, but currently only one). An abnormal exit from the manifest FSM is handled, but the same is not true for the block reader process. A large amount of crashes were seen at a customer where both the manifest FSM and reader were exiting, and the reader EXIT message arrived first, crashing the FSM. This is the place where it could be handled:
https://github.com/basho/riak_cs/blob/bb2ddcff99fbf48469262cae33d8765c630dc5c2/src/riak_cs_get_fsm.erl#L368
In this particular case, the process crashed because it tried to use the riak client, but it had stopped already. That is probably another bug that needs to be filed, but more details need to be dug up regarding how that can happen.