basho / riak_core

Distributed systems infrastructure used by Riak.
Apache License 2.0
1.23k stars 391 forks source link

Folsom error during riak_core shutdown. #388

Closed lukebakken closed 1 year ago

lukebakken commented 11 years ago

From the console.log of a Riak node. It appears the ets table identifier is invalid due to the table not existing anymore, causing the badarg error.

2013-09-13 14:47:10.353 [info] <0.723.0>@riak_kv_app:stop:187 Stopped  application riak_kv.
2013-09-13 14:47:10.433 [info] <0.208.0>@riak_core_app:prep_stop:101 Stopping application riak_core - disabling web services.
2013-09-13 14:47:11.085 [error] <0.15418.4619> gen_server <0.15418.4619> terminated with reason: bad argument in call to ets:select_delete(156574810873, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70
2013-09-13 14:47:11.119 [error] <0.15418.4619> CRASH REPORT Process <0.15418.4619> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(156574810873, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70 in gen_server:terminate/6 line 747
2013-09-13 14:47:11.134 [error] <0.15738.4619> gen_server <0.15738.4619> terminated with reason: bad argument in call to ets:select_delete(156576285205, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64
2013-09-13 14:47:11.164 [error] <0.15738.4619> CRASH REPORT Process <0.15738.4619> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(156576285205, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64 in gen_server:terminate/6 line 747
2013-09-13 14:47:11.212 [error] <0.15849.4619> gen_server <0.15849.4619> terminated with reason: bad argument in call to ets:select_delete(156576711030, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70
2013-09-13 14:47:11.235 [error] <0.15849.4619> CRASH REPORT Process <0.15849.4619> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(156576711030, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70 in gen_server:terminate/6 line 747
2013-09-13 14:47:11.272 [error] <0.15502.4619> gen_server <0.15502.4619> terminated with reason: bad argument in call to ets:select_delete(156575204034, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64
2013-09-13 14:47:11.287 [error] <0.15502.4619> CRASH REPORT Process <0.15502.4619> with 0 neighbours exited with reason: bad argument in call to ets:select_delete(156575204034, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64 in gen_server:terminate/6 line 747
2013-09-13 14:47:11.321 [error] <0.232.0> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide_uniform, 156574810873, 60) at <0.15418.4619> exit with reason bad argument in call to ets:select_delete(156574810873, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70 in context child_terminated
2013-09-13 14:47:11.346 [error] <0.232.0> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_metrics_spiral, 156576285205, 60) at <0.15738.4619> exit with reason bad argument in call to ets:select_delete(156576285205, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64 in context child_terminated
2013-09-13 14:47:11.380 [error] <0.232.0> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_sample_slide_uniform, 156576711030, 60) at <0.15849.4619> exit with reason bad argument in call to ets:select_delete(156576711030, [{{{'$1','_'},'_'},[{'<','$1',1379101570}],[true]}]) in folsom_sample_slide_uniform:trim/2 line 70 in context child_terminated
2013-09-13 14:47:11.410 [error] <0.232.0> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_metrics_spiral, 156575204034, 60) at <0.15502.4619> exit with reason bad argument in call to ets:select_delete(156575204034, [{{{'$1','_'},'_'},[{is_integer,'$1'},{'<','$1',1379101570}],[true]}]) in folsom_metrics_spiral:trim/2 line 64 in context child_terminated
2013-09-13 14:47:11.448 [error] <0.232.0> Supervisor folsom_sample_slide_sup had child undefined started with folsom_sample_slide_server:start_link(folsom_metrics_spiral, 156575204034, 60) at <0.15502.4619> exit with reason reached_max_restart_intensity in context shutdown
2013-09-13 14:47:12.686 [error] <0.231.0> Supervisor folsom_sup had child folsom_metrics_histogram_ets started with folsom_metrics_histogram_ets:start_link() at <0.234.0> exit with reason shutdown in context shutdown_error
2013-09-13 14:47:12.775 [error] <0.231.0> Supervisor folsom_sup had child folsom_sample_slide_sup started with folsom_sample_slide_sup:start_link() at <0.232.0> exit with reason shutdown in context shutdown_error
2013-09-13 14:47:55.679 [info] <0.208.0>@riak_core_app:stop:110 Stopped  application riak_core.

See support ticket #5992

slfritchie commented 11 years ago

Perhaps @jaredmorrow @Vagabond and/or @russelldb have any opinions on this? Would this issue be addressed by:

  1. merging the upstream branch in order to get https://github.com/boundary/folsom/pull/60, and then...
  2. using the new safely_notify* functions to avoid this harmless-but-customer-confusing race?
russelldb commented 11 years ago

I don't think that change would help: these errors are from processes that were started to trim the size of ets tables every N seconds that find their tables gone.

Is there a way they can monitor the existence of the table and just stop when the table is deleted?

slfritchie commented 11 years ago

Hrm ... just putting a 'catch' around the trim call at https://github.com/basho/folsom/blob/master/src/folsom_sample_slide_server.erl#L61 seems a bit ugly: it could mask bugs that someone might like to know about.

Another option would be to have all folsom tables owned by a single owner proc. Then the OTP application startup & shutdown ordering could take care of races like this one. The cost would be having to ask a single proc to create new ETS tables for you ... which might introduce a bottleneck that other Folsom users (or even Riak) might not appreciate?

Another option would be to use the 'heir' property of the ETS tables to pass them to a folsom OTP app proc to be the owner. I don't recall which version of Erlang/OTP introduced the 'heir' thingie, sorry.

@joewilliams Thoughts?

russelldb commented 11 years ago

Yes, I think that is the right thing to do. It has been mentioned a few times on the folsom repo issue list too. So far, I haven't had time to do this work.

joewilliams commented 11 years ago

Seems reasonable, I'm in the same boat, haven't had time to contribute to folsom recently to take care of things like this.

jrwest commented 10 years ago

Marking as milestone 2.0.1 since this seems to be an existing issue and we don't have a planned fix for 2.0-RC.

joewilliams commented 1 year ago

I got notified of this issue closing, may I recommend using https://github.com/folsom-project/folsom instead. We've made a number of fixes, improvements, etc since riak last sync'd with folsom that might fix this. I no longer have access to the original folsom repo so all the work goes into the new one now.

lukebakken commented 1 year ago

Thanks @joewilliams. I'm just closing issues to clean up my https://github.com/issues feed.

martinsumner commented 1 year ago

@joewilliams - Riak is now using folsom-project/folsom 1.0.0 in the latest release, thank-you for the ongoing updates

@lukebakken - apologies that you ended up with all those dead issues in your feed, happy for them to be closed as I can't imagine anyone will get round to them. Riak is not quite dead yet though, new releases are still being cut, and there are still some very-big installations in active use. We're doing what we can to keep it alive!