basho / machi

Machi file store
Apache License 2.0
122 stars 23 forks source link

client error when chain UPI=[] (unavailable), also perhaps file descriptor/other leak? #37

Open slfritchie opened 8 years ago

slfritchie commented 8 years ago

When I was at dinner, running env EQC_TIMEOUT=3600 rebar skip_deps=true -v eunit suites=machi_ap_repair_eqc tests=prop_repair_par_test_ on the commit b5005c35263e79389c42b1808fce1171b44f4fb3 (branch ss-repair-with-partition-simulator) with the https://gist.github.com/slfritchie/12e40859a08d5e4a678a patch applied, I saw this when I returned.

At a minimum, the machi_cr_client should do something less silly when the chain is not available.

Also, there may be a resource leak (e.g. file descriptor) that caused this test to fail after about 52 iterations (estimate, based on test output).

.  Got stable chain: [{a,{1113,[a,b,c],[],[]}},{b,{1113,[a,b,c],[],[]}},{c,{1113,[a,b,c],[],[]}}]
==== Start post operations, stabilize and confirm results
TODO: Using ?WORST_PROJ, chain is not available <0.31552.2>
TODO: Using ?WORST_PROJ, chain is not available <0.31538.2>
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
  NOT YET stable chain: [{a,{1117,[c],[a,b],[]}},{b,{1117,[c],[a,b],[]}},{c,{1117,[c],[a,b],[]}}]
MissingFileSummary [{<<"pre^3ef5a24e-3833-4ce3-9b32-e24aaf968795^1">>,
                     {1094,[a,b]}},
                    {<<"pre^69061121-5f50-4610-ac0e-4d9df67e249d^2">>,
                     {1044,[]}},
                    {<<"pre^b6be4085-b1a4-44b7-93a2-7f907acf9c95^1">>,
                     {1084,[b,c]}}]
Make repair directives: ... done
Out-of-sync data for FLU a: 0.1 MBytes
Out-of-sync data for FLU b: 0.1 MBytes
Out-of-sync data for FLU c: 0.1 MBytes
Execute repair directives: .. done
  Got stable chain: [{a,{1119,[c,a,b],[],[]}},{b,{1119,[c,a,b],[],[]}},{c,{1119,[c,a,b],[],[]}}]
  Written=10, DATALOSS=0, Acceptable=5
  Failed=0, Critical=0

.  Got stable chain: [{a,{1113,[a,b,c],[],[]}},{b,{1113,[a,b,c],[],[]}},{c,{1113,[a,b,c],[],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32479.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32447.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32487.2>
==== Start post operations, stabilize and confirm results
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32455.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
  NOT YET stable chain: [{a,{1126,[c],[b,a],[]}},{b,{1126,[c],[b,a],[]}},{c,{1126,[c],[b,a],[]}}]
TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
MissingFileSummary [{<<"pre^58d8ff5c-2371-4e5e-85f8-4fca1f44eedf^1">>,
                     {1034,[]}},
                    {<<"pre^8e67a7c2-570d-41e6-bf48-9405bfd777c2^1">>,
                     {1054,[c]}},
                    {<<"pre^979fad11-8e0e-4674-af5f-58aa3e5be461^2">>,
                     {1034,[a]}},
                    {<<"pre^e7670123-4c9e-41a3-8c20-93518dca6e01^1">>,
                     {1064,[a,b]}}]
Make repair directives: .... done
Out-of-sync data for FLU a: 0.1 MBytes
Out-of-sync data for FLU b: 0.1 MBytes
Out-of-sync data for FLU c: 0.1 MBytes
Execute repair directives: ... done
  Got stable chain: [{a,{1128,[c,b,a],[],[]}},{b,{1128,[c,b,a],[],[]}},{c,{1128,[c,b,a],[],[]}}]
  Written=8, DATALOSS=0, Acceptable=6
  Failed=0, Critical=0

.TODO: Using ?WORST_PROJ, chain is not available <0.32463.2>
TODO: Using ?WORST_PROJ, chain is not available <0.32483.2>
*failed*
in function eqc:quickcheck/1 (../src/eqc.erl, line 1270)
in call from machi_ap_repair_eqc:'-prop_repair_par_test_/0-fun-1-'/3 (test/machi_ap_repair_eqc.erl, line 91)
**exit:{{badmatch,[]},
 [{machi_cr_client,do_append_head2,7,
                   [{file,"src/machi_cr_client.erl"},{line,330}]},
  {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,607}]},
  {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}
  output:<<"Starting Quviq QuickCheck version 1.35.2
   (compiled at {{2015,6,23},{16,47,22}})
Licence for Basho reserved until {{2015,10,31},{2,57,8}}
">>
slfritchie commented 8 years ago

More info, when running with the SASL app running & logging errors.

=CRASH REPORT==== 30-Oct-2015::21:41:22 ===
  crasher:
    initial call: machi_cr_client:init/1
    pid: <0.1933.4>
    registered_name: []
    exception exit: {{badmatch,[]},
                     [{machi_cr_client,do_append_head2,7,
                          [{file,"src/machi_cr_client.erl"},{line,330}]},
                      {gen_server,try_handle_call,4,
                          [{file,"gen_server.erl"},{line,607}]},
                      {gen_server,handle_msg,5,
                          [{file,"gen_server.erl"},{line,639}]},
                      {proc_lib,init_p_do_apply,3,
                          [{file,"proc_lib.erl"},{line,237}]}]}
      in function  gen_server:terminate/7 (gen_server.erl, line 804)
[....]
    neighbour: [{pid,<0.182.0>},  %%% SLF note that this is a really "old" PID compared to the crasher
                  {registered_name,[]},
                  {initial_call,{erlang,apply,2}},
                  {current_function,{gen,do_call,4}},
                  {ancestors,[]},
                  {messages,
                   [{#Ref<0.0.0.8917>,{error,partition}},
[... a few hundred messages more ...]
                  {links,
                   [<0.31733.1>,<0.15066.2>,<0.22243.3>,<0.28288.3>,
                    <0.31857.3>,<0.4687.4>,<0.4779.4>,<0.4867.4>,<0.4907.4>,
[... very roughly, about 100,000 other process links ...]
                  {dictionary,
                   [{{memo,eqc_statem,non_interfering,
                      {machi_ap_repair_eqc,
                       {state,4,
                        {1445,935441,287549},
                        false,
                        [a,b,c,d],
                        [{a,a_chmgr},{b,b_chmgr},{c,c_chmgr},{d,d_chmgr}],
                        [{a,<0.8424.1>},
                         {b,<0.8425.1>},
                         {c,<0.8427.1>},
                         {d,<0.8429.1>}],
                        [{a,<0.8430.1>},
                         {b,<0.8437.1>},
                         {c,<0.8444.1>},
                         {d,<0.8451.1>},
                         {a,<0.8458.1>},
                         {b,<0.8465.1>},
                         {c,<0.8472.1>},
                         {d,<0.8479.1>},
                         {a,<0.8486.1>},
                         {b,<0.8493.1>},
                         {c,<0.8500.1>},
                         {d,<0.8507.1>}]},
[... are really super enormous process dictionary continues ...]
slfritchie commented 8 years ago

At the start of each QuickCheck test case, I check the # of procs and the proc limit:

.process_count = 3563 of 262144,
.process_count = 3544 of 262144,
.process_count = 4792 of 262144,
[...]
.process_count = 20874 of 262144,
.process_count = 21020 of 262144,
.process_count = 20879 of 262144,
.process_count = 20950 of 262144,
.process_count = 21060 of 262144,
.process_count = 21308 of 262144,
[...]
shino commented 8 years ago

One apparent point of process leak is that I forgot to stop some client processes in #state of EQC side :see_no_evil: