EGA-archive / LocalEGA

A federated storage for sensitive data
http://localega.readthedocs.io
Apache License 2.0
7 stars 16 forks source link

Shovels and federated queues regular disconnections #104

Closed silverdaz closed 1 year ago

silverdaz commented 4 years ago

Eventhough we are using a non-zero heartbeat, the federated queue does not seem to disconnect often, but the shovel does. Very regularly.

It might be a misconfiguration, but we get crash reports in the logs. I'm reporting it so we can have a look at it.

This is an example of shovel crash report:

mq                | ** Reason for termination ==
mq                | ** {socket_error,timeout}
mq                | 2020-05-27 11:10:03.302 [error] <0.2977.0> CRASH REPORT Process <0.2977.0> with 0 neighbours exited with reason: {socket_error,timeout} in gen_server:handle_common_reply/8 line 726
mq                | 2020-05-27 11:10:03.304 [error] <0.2976.0> Supervisor {<0.2976.0>,amqp_connection_type_sup} had child main_reader started with amqp_main_reader:start_link({sslsocket,{gen_tcp,#Port<0.35467>,tls_connection,undefined},<0.2978.0>}, <0.2977.0>, <0.2980.0>, {method,rabbit_framing_amqp_0_9_1}, <<"client 192.168.224.5:35047 -> 192.168.224.3:5671">>) at <0.2982.0> exit with reason {socket_error,timeout} in context child_terminated
mq                | 2020-05-27 11:10:03.307 [error] <0.2976.0> Supervisor {<0.2976.0>,amqp_connection_type_sup} had child main_reader started with amqp_main_reader:start_link({sslsocket,{gen_tcp,#Port<0.35467>,tls_connection,undefined},<0.2978.0>}, <0.2977.0>, <0.2980.0>, {method,rabbit_framing_amqp_0_9_1}, <<"client 192.168.224.5:35047 -> 192.168.224.3:5671">>) at <0.2982.0> exit with reason reached_max_restart_intensity in context shutdown
mq                | 2020-05-27 11:10:03.308 [info] <0.2961.0> terminating static worker with {outbound_conn_died,{socket_error,timeout}}
mq                | 2020-05-27 11:10:03.310 [error] <0.2975.0> Supervisor {<0.2975.0>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.2976.0>, {amqp_params_network,<<"legatest">>,<<"legatest">>,<<"lega">>,"cega-mq",5671,2047,0,60,60000,[{...},...],...}) at <0.2977.0> exit with reason {socket_error,timeout} in context child_terminated
mq                | 2020-05-27 11:10:03.311 [error] <0.2975.0> Supervisor {<0.2975.0>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.2976.0>, {amqp_params_network,<<"legatest">>,<<"legatest">>,<<"lega">>,"cega-mq",5671,2047,0,60,60000,[{...},...],...}) at <0.2977.0> exit with reason reached_max_restart_intensity in context shutdown
mq                | 2020-05-27 11:10:03.326 [error] <0.2961.0> ** Generic server <0.2961.0> terminating
mq                | ** Last message in was {'EXIT',<0.2977.0>,{socket_error,timeout}}
mq                | ** When Server state == {state,undefined,undefined,undefined,undefined,to_cega,static,#{ack_mode => on_confirm,dest => #{add_forward_headers => false,add_timestamp_header => false,current => {<0.2977.0>,<0.2991.0>,<<"amqps://cega-mq:5671/lega">>},fields_fun => #Fun<rabbit_amqp091_shovel.17.14964843>,module => rabbit_amqp091_shovel,props_fun => #Fun<rabbit_amqp091_shovel.17.14964843>,resource_decl => #Fun<rabbit_amqp091_shovel.22.14964843>,unacked => #{},uris => ["amqps://legatest:legatest@cega-mq:5671/lega?heartbeat=60&connection_attempts=30&retry_delay=10&server_name_indication=cega-mq&verify=verify_peer&fail_if_no_peer_cert=true&cacertfile=/etc/rabbitmq/CA.cert&certfile=/etc/rabbitmq/ssl.cert&keyfile=/etc/rabbitmq/ssl.key"]},name => to_cega,reconnect_delay => 5,shovel_type => static,source => #{current => {<0.2964.0>,<0.2971.0>,<<"amqp://">>},delete_after => never,module => rabbit_amqp091_shovel,prefetch_count => 10,queue => <<>>,remaining => unlimited,remaining_unacked => unlimited,resource_decl => #Fun<rabbit_amqp091_shovel.22.14964843>,uris => ["amqp://"]}},undefined,undefined,undefined,undefined,undefined}

and this one is for the federated queue crash report


mq                | ** Reason for termination ==
mq                | ** {socket_error,timeout}
mq                | 2020-05-27 11:13:13.147 [error] <0.3108.0> CRASH REPORT Process <0.3108.0> with 0 neighbours exited with reason: {socket_error,timeout} in gen_server:handle_common_reply/8 line 726
mq                | 2020-05-27 11:13:13.150 [error] <0.3107.0> Supervisor {<0.3107.0>,amqp_connection_type_sup} had child main_reader started with amqp_main_reader:start_link({sslsocket,{gen_tcp,#Port<0.35739>,tls_connection,undefined},<0.3109.0>}, <0.3108.0>, <0.3111.0>, {method,rabbit_framing_amqp_0_9_1}, <<"client 192.168.224.5:50147 -> 192.168.224.3:5671">>) at <0.3113.0> exit with reason {socket_error,timeout} in context child_terminated
mq                | 2020-05-27 11:13:13.157 [error] <0.3107.0> Supervisor {<0.3107.0>,amqp_connection_type_sup} had child main_reader started with amqp_main_reader:start_link({sslsocket,{gen_tcp,#Port<0.35739>,tls_connection,undefined},<0.3109.0>}, <0.3108.0>, <0.3111.0>, {method,rabbit_framing_amqp_0_9_1}, <<"client 192.168.224.5:50147 -> 192.168.224.3:5671">>) at <0.3113.0> exit with reason reached_max_restart_intensity in context shutdown
mq                | 2020-05-27 11:13:13.159 [error] <0.3106.0> Supervisor {<0.3106.0>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3107.0>, {amqp_params_network,<<"legatest">>,<<"legatest">>,<<"lega">>,"cega-mq",5671,2047,0,60,60000,[{...},...],...}) at <0.3108.0> exit with reason {socket_error,timeout} in context child_terminated
mq                | 2020-05-27 11:13:13.161 [error] <0.3106.0> Supervisor {<0.3106.0>,amqp_connection_sup} had child connection started with amqp_gen_connection:start_link(<0.3107.0>, {amqp_params_network,<<"legatest">>,<<"legatest">>,<<"lega">>,"cega-mq",5671,2047,0,60,60000,[{...},...],...}) at <0.3108.0> exit with reason reached_max_restart_intensity in context shutdown
mq                | 2020-05-27 11:13:13.164 [error] <0.3092.0> ** Generic server <0.3092.0> terminating
mq                | ** Last message in was {'DOWN',#Ref<0.212453901.2469658625.127060>,process,<0.3122.0>,shutdown}
mq                | ** When Server state == {state,{amqqueue,{resource,<<"/">>,queue,<<"from_cega">>},true,false,none,[],<0.465.0>,[],[],[],[{vhost,<<"/">>},{name,<<"from_cega">>},{pattern,<<"from_cega">>},{'apply-to',<<"queues">>},{definition,[{<<"federation-upstream">>,<<"from_cega">>}]},{priority,0}],undefined,[],[rabbit_federation_queue],live,0,[],<<"/">>,#{user => <<"rmq-internal">>}},true,<0.3108.0>,<0.3122.0>,<0.3095.0>,<0.3102.0>,{upstream,[<<"amqps://legatest:legatest@cega-mq:5671/lega?heartbeat=60&connection_attempts=30&retry_delay=10&server_name_indication=cega-mq&verify=verify_peer&fail_if_no_peer_cert=true&cacertfile=/etc/rabbitmq/CA.cert&certfile=/etc/rabbitmq/ssl.cert&keyfile=/etc/rabbitmq/ssl.key">>],<<"from_cega">>,<<"v1.files">>,1000,1,5,none,none,false,'on-confirm',none,<<"from_cega">>,false},{upstream_params,<<"amqps://legatest:legatest@cega-mq:5671/lega?heartbeat=60&connection_attempts=30&retry_delay=10&server_name_indication=cega-mq&verify=verify_peer&fail_if_no_peer_cert=true&cacertfile=/etc/rabbitmq/CA.cert&certfile=/etc/rabbitmq/ssl.cert&keyfile=/etc/rabbitmq/ssl.key">>,{amqp_params_network,<<"legatest">>,<<"legatest">>,<<"lega">>,"cega-mq",5671,2047,0,60,60000,[{server_name_indication,"cega-mq"},{fail_if_no_peer_cert,true},{verify,verify_peer},{keyfile,"/etc/rabbitmq/ssl.key"},{certfile,"/etc/rabbitmq/ssl.cert"},{cacertfile,"/etc/rabbitmq/CA.cert"}],[#Fun<amqp_uri.12.95251685>,#Fun<amqp_uri.12.95251685>],[],[]},{amqqueue,{resource,<<"lega">>,queue,<<"v1.files">>},true,false,none,[],<0.465.0>,[],[],[],[{vhost,<<"/">>},{name,<<"from_cega">>},{pattern,<<"from_cega">>},{'apply-to',<<"queues">>},{definition,[{<<"federation-upstream">>,<<"from_cega">>}]},{priority,0}],undefined,[],[rabbit_federation_queue],live,0,[],<<"/">>,#{user => <<"rmq-internal">>}},<<"amqps://cega-mq:5671/lega">>,[{<<"uri">>,longstr,<<"amqps://cega-mq:5671/lega">>},{<<"queue">>,longstr,<<"from_cega">>}]},{0,nil}}
mq                | ** Reason for termination ==
mq                | ** {upstream_channel_down,shutdown}
mq                | 2020-05-27 11:13:13.164 [error] <0.3092.0> CRASH REPORT Process <0.3092.0> with 0 neighbours exited with reason: {upstream_channel_down,shutdown} in gen_server2:terminate/3 line 1166
mq                | 2020-05-27 11:13:13.168 [error] <0.467.0> Supervisor {<0.467.0>,rabbit_federation_link_sup} had child {upstream,[<<"amqps://legatest:legatest@cega-mq:5671/lega?heartbeat=60&connection_attempts=30&retry_delay=10&server_name_indication=cega-mq&verify=verify_peer&fail_if_no_peer_cert=true&cacertfile=/etc/rabbitmq/CA.cert&certfile=/etc/rabbitmq/ssl.cert&keyfile=/etc/rabbitmq/ssl.key">>],
mq                |           <<"from_cega">>,<<"v1.files">>,1000,1,5,none,none,false,
mq                |           'on-confirm',none,<<"from_cega">>,false} started with rabbit_federation_queue_link:start_link({{upstream,[<<"amqps://legatest:legatest@cega-mq:5671/lega?heartbeat=60&connection_attempts=30&r...">>],...},...}) at <0.3092.0> exit with reason {upstream_channel_down,shutdown} in context child_terminated```
silverdaz commented 1 year ago

This seems to be solved in the latter version of RabbitMQ. Closing the issue