Open opie4624 opened 12 years ago
I've been seeing some weirdness like this off and on too. I started a cluster with 3 nodes, tried replicating a ~1GB database, and had the process fail several times. It would always make some amount of progress, but ultimately it died on me each time.
If someone can tell me what to grep I'll search my logs, but here's a couple snippets that I found. Unfortunately I didn't run these tests in a very methodical way, and I've been having trouble reproducing these issues, so I'll try the scattershot approach in an effort to provide something valuable by accident :)
$ apt-cache show bigcouch
Package: bigcouch
Version: 0.4.0-1
Architecture: i386
Maintainer: Cloudant Support <support@cloudant.com>
Installed-Size: 35708
Depends: libc6 (>= 2.4), libcurl3-gnutls (>= 7.16.2-1), libgcc1 (>= 1:4.1.1), libicu44 (>= 4.4.1-1), libssl1.0.0 (>= 1.0.0), libstdc++6 (>= 4.1.1), libtinfo5 (>= 5.6+20070908), libmozjs185-cloudant, runit
Conflicts: couchdb
Filename: dists/oneiric/main/binary-i386/bigcouch_0.4.0-1_i386.deb
Size: 20362136
MD5sum: eb73e37d2513671a6ada29a6601a94d7
SHA1: fcd72360d49b8c124bbbdc5d7779bfbe0c89e30e
SHA256: 99d76b076e3a24bd49d841139e830c151e9f4fd1a7788ef5e2e5e5f0b95a725f
Section: misc
Priority: extra
Homepage: http://cloudant.com/
Description: BigCouch is a dynamo-style implementation of Apache CouchDB
Tried accessing a _conflicts view, this was returned to my browser (not from the logs, chrome had this document returned to it):
{"error":"{badarg,[{erlang,port_command,
[#Port<8867.236416>,
[91,
[34,<<"map_doc">>,34],
44,
[123,
[34,<<"_id">>,34],
58,
[34,<<"9f3c9b75961cda4d862dd97bc3bf5b97">>,34],
44,
[34,<<"_rev">>,34],
58,
[34,<<"1-aa0667b938e101213a028b30414afdff">>,34],
44,
<snip>
93],
125],
93,10]]},
{couch_os_process,writeline,2},
{couch_os_process,writejson,2},
{couch_os_process,handle_call,3},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}","reason":"{gen_server,call,
[<8867.1698.124>,
{prompt,[<<"map_doc">>,
{[{<<"_id">>,<<"9f3c9b75961cda4d862dd97bc3bf5b97">>},
{<<"_rev">>,<<"1-aa0667b938e101213a028b30414afdff">>},
<snip>
{<<"type">>,<<"text/html">>}]}]}]}]}]}]},
infinity]}"}
Looking through /opt/bigcouch/var/log/bigcouch.log
I've been seeing a ton of other types of errors as well. They look like timeouts, but I can confirm that I did not intentionally take these nodes down.
I do know that these machines had very high CPU load (around 5.0-10.0 range) for a 1-core, 1-CPU box, which may have caused them to become unresponsive. One of my 3 nodes did seem to actually die (sv status bigcouch reported that it was no longer running, box load went to 0.0).
The strange thing to me is that the node reporting the errors below (domU-12-31-*****) was the local node that I was directly querying via curl, and not a remote node (and not the node that crashed!)
[Wed, 07 Mar 2012 16:57:41 GMT] [error] [<0.14949.21>] [--------] couch_rep_httpc retry post http://127.0.0.1:5985/requests/_bulk_docs in 0.5 seconds due to {error, req_timedout}
[Wed, 07 Mar 2012 16:57:45 GMT] [error] [emulator] [--------] Error in process <0.10793.23> on node 'bigcouch@domU-12-31-*****.compute-1.internal' with exit value: {{badmatch,{error,timeout}},[{couch_db,'-load_validation_funs/1-fun-1-',1}]}
[Wed, 07 Mar 2012 16:57:45 GMT] [error] [<0.5842.23>] [0e03c93d] could not load validation funs {{badmatch,
{error,
timeout}},
[{couch_db,
'-load_validation_funs/1-fun-1-',
1}]}
I was trying to replicate a database from cloudant, so you'll see that in these errors (I was also running this node on port 5985 instead of 5984 because I had haproxy on the standard port; might this have been causing my issues?):
[Wed, 07 Mar 2012 17:06:02 GMT] [info] [<0.18687.29>] [c6ad0c74] 127.0.0.1 127.0.0.1:5985 POST /requests/_bulk_docs 201 ok 21992
[Wed, 07 Mar 2012 17:06:24 GMT] [error] [<0.14781.21>] [--------] changes loop, process <0.14782.21> died with reason normal
[Wed, 07 Mar 2012 17:06:26 GMT] [error] [<0.14781.21>] [--------] ** Generic server <0.14781.21> terminating
** Last message in was {'EXIT',<0.14782.21>,normal}
** When Server state == {state,nil,nil,
[nil,
{http_db,
"http://*****:*****@*****.cloudant.com/scrape-requests-experimental/",
[],"_changes",
[{"User-Agent","CouchDB/1.1.1"},
{"Accept","application/json"}],
[{"style",all_docs},
{"heartbeat",10000},
{"since",
<<"11713-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrjriLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgCgXlAu">>},
{"feed",normal}],
get,nil,
[{stream_to,{<0.14781.21>,once}},
{stream_to,{<0.14781.21>,once}},
{response_format,binary}],
10,500,<0.21884.29>},
<<"11713-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrjriLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgCgXlAu">>,
[{<<"target">>,
{[{<<"url">>,
<<"http://127.0.0.1:5985/requests">>},
{<<"headers">>,
{[{<<"authorization">>,<<>>},
{<<"cookie">>,<<>>}]}}]}},
{<<"source">>,
<<"http://*****:*****@*****.cloudant.com/scrape-requests-experimental">>}]],
<<"11713-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrjriLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgCgXlAu
<0.21884.29>,
{1331,139984,622071},
false,255,<<>>,nil,
{[{[{<<"seq">>,
<<"11713-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrjriLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgCgXlAu">>},
{<<"id">>,
<<"3eb108cb2abcad43fa4e55f6464c1d8f">>},
{<<"changes">>,
[{[{<<"rev">>,
<<"1-2549e39a15d1ccde6625327e3d1b1f36">>}]}]}]},
<snip>
[{[{<<"seq">>,
<<"11459-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDTyiqNhM82pIcgGRSPVQnVwDxFuaxAEmGBiAF1LwfpJu7j3h7IboPQHRD7J6QBQA9rk8w">>},
{<<"id">>,
<<"b47942e6fe771dad981272a2423d12d4">>},
{<<"changes">>,
[{[{<<"rev">>,
<<"1-73761c067e8ada316d50e4668f3e2ce9">>}]}]}]}]},
nil}
** Reason for termination ==
** {<0.14782.21>,normal}
Which came coupled with:
[Wed, 07 Mar 2012 17:06:28 GMT] [error] [<0.14781.21>] [--------] {error_report,<0.85.0>,
{<0.14781.21>,crash_report,
[[{initial_call,
{couch_rep_changes_feed,init,
['Argument__1']}},
{pid,<0.14781.21>},
{registered_name,[]},
{error_info,
{exit,
{<0.14782.21>,normal},
[{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}},
{ancestors,
[<0.14274.21>,couch_rep_sup,
couch_primary_services,
couch_server_sup,<0.86.0>]},
{messages,
[{ibrowse_async_response,
{1331,139984,622071},
<<"{\"results\":[\n">>},
{ibrowse_async_response,
{1331,139984,622071},
<<"{\"seq\":\"11714-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrnriLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgCg3FAv\",\"id\":\"34aaf4477f01f800453c4683dfd1e875\",\"changes\":[{\"rev\":\"1-16eaea597400450fc527da6193f99271\"}]}">>},
{ibrowse_async_response,
{1331,139984,622071},
<<",\r\n{\"seq\":\"11715-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3KLElMT9VLzskvTUnMK9HLSy3JAapkSmRIsv___39WEgMDzzxUbSZ4tCU5AMmkeqhOrgbiLcxjAZIMDUAKqHk_SDf3a-Ltheg-ANENsXtXFgChWlAw\",\"id\":\"a232fdb860718fb7ca9468f42449b772\",\"changes\":[{\"rev\":\"1-d50a8d6fdd0371a8ef53648e115e9a2b\"}]}">>},
<snip>
{ibrowse_async_response,
{1331,139984,622071},
<<",\r\n{\"seq\":\"11721-g1AAAADzeJzLYWBgYMlgTmFQSElKzi9KdUhJMtLLTS3">>},
{ibrowse_async_response,
{1331,139984,622071},
{error,closing_on_request}}]},
{links,[<0.14274.21>]},
{dictionary,
[{timeout,
{1331140015239976,
#Ref<0.0.11.105174>}}]},
{trap_exit,true},
{status,running},
{heap_size,46368},
{stack_size,24},
{reductions,6246576}],
[]]}}
and
[Wed, 07 Mar 2012 17:06:28 GMT] [error] [<0.14274.21>] [--------] {error_report,<0.85.0>,
{<0.14274.21>,crash_report,
[[{initial_call,
{couch_rep,init,['Argument__1']}},
{pid,<0.14274.21>},
{registered_name,[]},
{error_info,
{exit,
{<0.14782.21>,normal},
[{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]}},
{ancestors,
[couch_rep_sup,couch_primary_services,
couch_server_sup,<0.86.0>]},
{messages,[{update_stats,docs_read,1}]},
{links,
[<0.14947.21>,<0.14949.21>,<0.95.0>,
<0.14945.21>]},
{dictionary,
[{task_status_update,
{{1331,139223,99367},0}}]},
{trap_exit,true},
{status,running},
{heap_size,6765},
{stack_size,24},
{reductions,78642}],
[{neighbour,
[{pid,<0.14949.21>},
{registered_name,[]},
{initial_call,{erlang,apply,2}},
{current_function,
{mochijson2,json_bin_is_safe,1}},
{ancestors,[]},
{messages,[]},
{links,[<0.14274.21>]},
{dictionary,[]},
{trap_exit,false},
{status,runnable},
{heap_size,832040},
{stack_size,60},
{reductions,76725392}]}]]}}
Sorry this post was so long; as I mentioned I am not exactly sure what caused the crashes or how to reproduce. If there's some way for me to be more helpful, happy to do it!
In doing some debugging, I just tried doing a replication from one shard of the db on the backend (5986) port to a stand-alone, vanilla couchdb instance.
What I'm seeing is quite interesting. So this is a push replication, from nodeA (backend of bigcouch) to nodeB (vanilla couchdb)
(note: I've trimmed the Headers lines from these logs and obfuscated IPs)
NodeA's log:
[Fri, 09 Mar 2012 02:56:28 GMT] [debug] [<0.28284.0>] [--------] 'POST' /_replicate {1,1} from "74..192"
[Fri, 09 Mar 2012 02:56:28 GMT] [debug] [<0.28284.0>] [--------] OAuth Params: []
[Fri, 09 Mar 2012 02:56:28 GMT] [debug] [<0.28308.0>] [--------] found a replication log for shards/20000000-3fffffff/stream-data.1318037979
[Fri, 09 Mar 2012 02:56:28 GMT] [debug] [<0.28308.0>] [--------] found a replication log for http://74..192:5984/stream-data/
[Fri, 09 Mar 2012 02:56:28 GMT] [info] [<0.28284.0>] [--------] starting replication "0640402dc1595119b9ff16234d433df6" at <0.28308.0>
[Fri, 09 Mar 2012 02:56:32 GMT] [debug] [<0.28328.0>] [--------] missing_revs updating committed seq to 4233
[Fri, 09 Mar 2012 02:56:34 GMT] [debug] [<0.28429.0>] [--------] source doesn't need a full commit
[Fri, 09 Mar 2012 02:56:34 GMT] [info] [<0.28308.0>] [--------] recording a checkpoint for shards/20000000-3fffffff/stream-data.1318037979 -> http://74..192:5984/stream-data/ at source update_seq 4233
[Fri, 09 Mar 2012 02:56:57 GMT] [debug] [<0.28328.0>] [--------] missing_revs updating committed seq to 4566
[Fri, 09 Mar 2012 02:57:02 GMT] [debug] [<0.28926.0>] [--------] source doesn't need a full commit
[Fri, 09 Mar 2012 02:57:02 GMT] [info] [<0.28308.0>] [--------] recording a checkpoint for shards/20000000-3fffffff/stream-data.1318037979 -> http://74..192:5984/stream-data/ at source update_seq 4566
[Fri, 09 Mar 2012 02:57:33 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 0.5 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 02:58:09 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 1.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 02:58:46 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 2.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 02:59:24 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 4.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:00:03 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 8.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:00:48 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 16.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:01:39 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 32.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:02:47 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 64.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:04:27 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 128.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:07:10 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc retry post http://74..192:5984/stream-data/_bulk_docs in 256.0 seconds due to {error, req_timedout}
[Fri, 09 Mar 2012 03:12:02 GMT] [error] [<0.28332.0>] [--------] couch_rep_httpc request failed after 10 retries: http://74..192:5984/stream-data/
NodeB's log:
[debug] [<0.3038.0>] 'HEAD' /stream-data/ {1,1} from "69..244"
[info] [<0.3038.0>] 69..244 - - 'HEAD' /stream-data/ 200
[debug] [<0.3039.0>] 'GET' /stream-data/ {1,1} from "69..244"
[info] [<0.3039.0>] 69..244 - - 'GET' /stream-data/ 200
[debug] [<0.3040.0>] 'GET' /stream-data/_local/0640402dc1595119b9ff16234d433df6 {1,1} from "69..244"
][info] [<0.3040.0>] 69..244 - - 'GET' /stream-data/_local/0640402dc1595119b9ff16234d433df6 200
[debug] [<0.3041.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3041.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3042.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3042.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3055.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3055.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3062.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3062.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3076.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[debug] [<0.3077.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3076.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[info] [<0.3077.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3081.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[debug] [<0.3091.0>] 'POST' /stream-data/_ensure_full_commit {1,1} from "69..244"
[info] [<0.3091.0>] 69..244 - - 'POST' /stream-data/_ensure_full_commit 201
[debug] [<0.3092.0>] 'PUT' /stream-data/_local/0640402dc1595119b9ff16234d433df6 {1,1} from "69..244"
[info] [<0.3092.0>] 69..244 - - 'PUT' /stream-data/_local/0640402dc1595119b9ff16234d433df6 201
[info] [<0.3081.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3096.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.3096.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.3308.0>] 'POST' /stream-data/_missing_revs {1,1} from "69..244"
[info] [<0.3308.0>] 69..244 - - 'POST' /stream-data/_missing_revs 200
[debug] [<0.3348.0>] 'POST' /stream-data/_ensure_full_commit {1,1} from "69..244"
[info] [<0.3348.0>] 69..244 - - 'POST' /stream-data/_ensure_full_commit 201
[debug] [<0.3349.0>] 'PUT' /stream-data/_local/0640402dc1595119b9ff16234d433df6 {1,1} from "69..244"
[info] [<0.3349.0>] 69..244 - - 'PUT' /stream-data/_local/0640402dc1595119b9ff16234d433df6 201
[debug] [<0.3351.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[debug] [<0.3675.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.3351.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.4011.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.3675.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.4343.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.4011.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.4700.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.4343.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.5097.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.4700.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.5565.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.5097.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[info] [<0.5565.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.6171.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.6171.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.7062.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.7062.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.8530.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.8530.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
[debug] [<0.11145.0>] 'POST' /stream-data/_bulk_docs {1,1} from "69..244"
[info] [<0.11145.0>] 69..244 - - 'POST' /stream-data/_bulk_docs 201
What's weird, is that for every post that timed out on the source, it was received by the destination. Remember, the destination isn't part of a cluster, it's just a standalone, vanilla couchdb instance.
I'm seeing the same issue when enabling continuous replication jobs between two bigcouches. Things seem to work for a period of time, then the process stops logging and responding on http. Restarting the process starts the replication jobs again. I can't track down the root cause. Anyone got any leads on this?
@dhdanno: bigcouch is discontinued. Clustering code has been merged into upstream couchdb.
Thanks for the update. I do realize this, but we still are using it in production so i thought I'd ask. I believe the issue was the sheer volume of documents being replicated in some of the databases causing the process to crash. I ended up doing a manual seed of the data, then enabling replication... I believe this is the way it needs to be done!
Doing a Pull replication from ClusterA to ClusterB. {"source":"http://node1.clusterA:5984/db", "target":"http://node2.clusterB:5984/db"} When using the
_replicator
db, state is triggered. When using_replicate
, I get{http_request_failed,<<"failed to replicate http://node2.clusterB:5984/db/">>}
returned.Logs from node2.clusterB during the problem occurrence are available here: https://raw.github.com/gist/c372d0fbd0f01a2e0fb9/58a96d00c5b58d4d457ece12813cbf013a932b5c/bigcouch.log.out