cloudant / bigcouch

Putting the 'C' back in CouchDB
http://bigcouch.cloudant.com/
Apache License 2.0
565 stars 52 forks source link

could not load validation funs #56

Open markabey opened 13 years ago

markabey commented 13 years ago

I'm seeing errors when replicating an existing dataset into bigcouch 0.3.1 Bigcouch is throwing errors with the design documents There doesn't seem to be any problem with the docs on couch 1.0.3 and they do get replicated into bigcouch (I can query them) but these errors appear in the logs.

could not load validation funs {{badmatch, {function_clause, [{fabric,'-design_docs/1-fun-0-', [{error,timeout},

Error in process <0.12889.45> on node 'bigcouch@X.X.X.107' with exit value: {function_clause ,[{fabric,'-design_docs/1-fun-0-',[{error,{noproc,{gen_server,call,[<0.25484.43>,{pread_iolist,77411342},infinity]},[

more log info here: http://pastebin.com/xgX2YR7G

The list of design docs in the stack trace is huge and doesn't seem to show which one is problematic

setup: This is the bigcouch 0.3.1 compiled rpm On CentOS 5.6 64bit curl: 7.20.1 erlang: R13B04 (erts-5.7.5) icu: 4.4.1 js: js 1.7

in a 3 node ring on 3 VMs, replicating from couchdb 1.0.3 on a local network

bdionne commented 13 years ago

Thanks for the report, I'll take a look

rnewson commented 13 years ago

I'm struggling to reproduce this, my validate funs work fine on centos 5.6 64-bit using the 0.3.1 rpm.

markabey commented 13 years ago

I think I have misread this, it seems like I can't put any document into that database. When putting a design doc in, then it shows this error. When putting in a standard blank doc (just the id) through futon I get the could not load validation funs server error and this stack trace afterwards: http://pastebin.com/AfkkFCpf

markabey commented 13 years ago

Didn't mean to close, keep clicking wrong button... I get the internal server error when trying to delete documents too. Could it be that the database is corrupt? The other dbs in the bigcouch seem to work correctly

rnewson commented 13 years ago

If the database is small enough, could you make a copy of each shard? I'd like to try it locally. For whatever reason, you seem to always get a timeout. There's a separate bug that prevents this bubbling up to a sensible http error response (but I maybe fixed this, coincidentally, this morning).

If it's too big to send, at least take a backup first and then try compacting each shard in turn. I'd be very interested in any errors you get, if any, and whether the problem persists afterward.

rnewson commented 13 years ago

On closer inspection this looks like a genuine timeout. I believe your servers are returning the internal request too slowly and triggering our internal timeouts.

Here's where we start;

refresh_validate_doc_funs(Db) ->
    {ok, DesignDocs} = couch_db:get_design_docs(Db),

This code does not expect to fail, because on CouchDB, it can't fail (the data is local and Db contains an open file descriptor).

In a cluster, of course, it can fail. Here's what get_design_docs(Db) does in BigCouch;

get_design_docs(#db{name = <<"shards/", _/binary>> = ShardName}) ->
    {_, Ref} = spawn_monitor(fun() ->
        exit(fabric:design_docs(mem3:dbname(ShardName)))
    end),
    receive {'DOWN', Ref, _, _, Response} ->
        Response
    end;

after a little further indirection this takes us to fabric_view_all_docs.erl and this bit of code;

try rexi_utils:recv(Workers, #shard.ref, fun handle_message/3,
    State, infinity, 5000) of

While we will wait forever to receive all answers, we expect a reply from a worker (any worker) within 5 seconds of the previous reply. If we don't get it, we do this;

 {timeout, NewState} ->
    Callback({error, timeout}, NewState#collector.user_acc);

The first item of that callback is ultimately the result of couch_db:get_design_docs(Db) and you'll remember the code asserts;

 {ok, DesignDocs} = couch_db:get_design_docs(Db),

And that's badmatch because, and obviously, {ok, DesignDocs} does not match {error, timeout}

rnewson commented 13 years ago

See https://github.com/cloudant/fabric/commit/b2a85603faf384ea427475c74637b3eb2d785b72 for a quick fix, it allows the short 5 second timeout to be configured cleanly from config. I'll talk with the rest of the team as to whether this is the right approach.