apache / couchdb

Seamless multi-master syncing database with an intuitive HTTP/JSON API, designed for reliability
https://couchdb.apache.org/
Apache License 2.0
6.27k stars 1.03k forks source link

no match of right hand value {error,enospc} #5265

Open SourceR85 opened 1 month ago

SourceR85 commented 1 month ago

Description

I've set up a fresh CouchDB 3.4.1 instance (as Docker image, build from https://github.com/apache/couchdb-docker/tree/main/3.4.1) Then I've started a replication from prod.-server and saw endless messages of "no match of right hand value {error,enospc}"

Here a (truncated) copy of the docker log: couchdb.tar.gz

Your Environment

Additional Context

Docker Engine
 Version:    27.3.1
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.16.2-desktop.1
  compose: Docker Compose (Docker Inc.)
    Version:  v2.29.2-desktop.2
  desktop: Docker Desktop commands (Alpha) (Docker Inc.)
    Version:  v0.0.15

I've talked a bit with Jan at slack, his first thoughts: https://app.slack.com/client/T49P1AZRT/C49LEE7NW

nickva commented 1 month ago

enospc from no match of right hand value {error,enospc} indicates we're probably running out of disk space [1]

It should be a more friendly message in the log, but at least first sight that's what's jumping out.

[1] https://www.man7.org/linux/man-pages/man3/errno.3.html

SourceR85 commented 1 month ago

enospc from no match of right hand value {error,enospc} indicates we're probably running out of disk space

~That's not a problem...~ I have 799.7 GB of 2TB free (the DB I replicate is 86.1GB)

nickva commented 1 month ago

Is there any chance view directory is configured to write another disk or the disks may fail to mount and it ends up writting to the root file system. enospc is usually a transparent passthrough error from the FS layer.

The first instance in the logs seem to come from writting an attachments:

gen,do_call,4,[{file,"gen.erl"},{line,237}]},{gen_server,call,3,[{file,"gen_server.erl"},{line,381}]},
{couch_att,write_streamed_attachment,3,

Is there a way to reconfigure the data directory or point it to another volume? Or tests if you can write to it manually? Verify that indeed the data directory is pointing the mounted large volume, sometimes misconfigurations happen and I've seen writes going to another directory than the indentded one.

SourceR85 commented 1 month ago

As you expect: the docker volume got stuck... Can't write content into data (just touch file works)

This is my docker deployment (secrets removed) couchdb.tar.gz There's nothing fancy in it, as far as I can say...

nickva commented 1 month ago

Can't write content into data (just touch file works)

That would explain it, I think. Good find. It's sneaky that touch works though.

SourceR85 commented 1 month ago

Just for curiosity, I stopped the container, rm & created couchdb-data and started the replication again: same result...

[notice] 2024-09-30T16:14:19.553744Z nonode@nohost <0.14636.101> -------- Retrying POST request to http://localhost:5984/hzd/_bulk_docs in 4.0 seconds due to error {code,500}
[error] 2024-09-30T16:14:19.574327Z nonode@nohost <0.16657.101> d5dfe20e02 rexi_server: from: nonode@nohost(<0.19120.101>) mfa: fabric_rpc:update_docs/3 exit:{{badmatch,{error,enospc}},[{couch_bt_engine,write_doc_body,2,[{file,"src/couch_bt_engine.erl"},{line,439}]},{couch_db_updater,'-flush_trees/3-fun-0-',6,[{file,"src/couch_db_updater.erl"},{line,384}]},{couch_key_tree,mapfold_simple,4,[{file,"src/couch_key_tree.erl"},{line,464}]},{couch_key_tree,mapfold_simple,4,[{file,"src/couch_key_tree.erl"},{line,473}]},{couch_key_tree,mapfold,3,[{file,"src/couch_key_tree.erl"},{line,457}]},{couch_db_updater,flush_trees,3,[{file,"src/couch_db_updater.erl"},{line,373}]},{couch_db_updater,update_docs_int,4,[{file,"src/couch_db_updater.erl"},{line,718}]},{couch_db_updater,handle_info,2,[{file,"src/couch_db_updater.erl"},{line,183}]}]} [{couch_db,collect_results,3,[{file,"src/couch_db.erl"},{line,1457}]},{couch_db,collect_results_with_metrics,3,[{file,"src/couch_db.erl"},{line,1439}]},{couch_db,write_and_commit,4,[{file,"src/couch_db.erl"},{line,1471}]},{couch_db,update_docs,4,[{file,"src/couch_db.erl"},{line,1333}]},{fabric_rpc,with_db,3,[{file,"src/fabric_rpc.erl"},{line,360}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,141}]}]
[info] 2024-09-30T16:14:19.574423Z nonode@nohost <0.243.0> -------- db shards/e0000000-ffffffff/hzd.1727710380 died with reason {{badmatch,{error,enospc}},[{couch_bt_engine,write_doc_body,2,[{file,"src/couch_bt_engine.erl"},{line,439}]},{couch_db_updater,'-flush_trees/3-fun-0-',6,[{file,"src/couch_db_updater.erl"},{line,384}]},{couch_key_tree,mapfold_simple,4,[{file,"src/couch_key_tree.erl"},{line,464}]},{couch_key_tree,mapfold_simple,4,[{file,"src/couch_key_tree.erl"},{line,473}]},{couch_key_tree,mapfold,3,[{file,"src/couch_key_tree.erl"},{line,457}]},{couch_db_updater,flush_trees,3,[{file,"src/couch_db_updater.erl"},{line,373}]},{couch_db_updater,update_docs_int,4,[{file,"src/couch_db_updater.erl"},{line,718}]},{couch_db_updater,handle_info,2,[{file,"src/couch_db_updater.erl"},{line,183}]}]}
[error] 2024-09-30T16:14:19.574887Z nonode@nohost <0.18010.101> -------- gen_server <0.18010.101> terminated with reason: no match of right hand value {error,enospc} at couch_bt_engine:write_doc_body/2(line:439) <= couch_db_updater:'-flush_trees/3-fun-0-'/6(line:384) <= couch_key_tree:mapfold_simple/4(line:464) <= couch_key_tree:mapfold_simple/4(line:473) <= couch_key_tree:mapfold/3(line:457) <= couch_db_updater:flush_trees/3(line:373) <= couch_db_updater:update_docs_int/4(line:718) <= couch_db_updater:handle_info/2(line:183)
  last msg: redacted
     state: {db,1,<<"shards/e0000000-ffffffff/hzd.1727710380">>,"./data/shards/e0000000-ffffffff/hzd.1727710380.couch",{couch_bt_engine,{st,"./data/shards/e0000000-ffffffff/hzd.1727710380.couch",<0.19406.101>,#Ref<0.3603940510.502005771.203208>,undefined,{db_header,8,30406,0,{9450247660,{29670,687,{size_info,9279630171,9278136634}},12600491},{9450249167,30357,11927090},{9448039553,[],2388},nil,nil,4251,1000,<<"2719778795232e78e860e5e8ab70c794">>,[{nonode@nohost,0}],0,1000,0},false,{btree,<0.19406.101>,{9450247660,{29670,687,{size_info,9279630171,9278136634}},12600491},fun couch_bt_engine:id_tree_split/1,fun couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:id_tree_reduce/2,snappy},{btree,<0.19406.101>,{9450249167,30357,11927090},fun couch_bt_engine:seq_tree_split/1,fun couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:seq_tree_reduce/2,snappy},{btree,<0.19406.101>,{9448039553,[],2388},fun couch_bt_engine:local_tree_split/1,fun couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btree,<0.19406.101>,nil,fun couch_bt_engine:purge_tree_split/1,fun couch_bt_engine:purge_tree_join/2,undefined,fun couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.19406.101>,nil,fun couch_bt_engine:purge_seq_tree_split/1,fun couch_bt_engine:purge_seq_tree_join/2,undefined,fun couch_bt_engine:purge_tree_reduce/2,snappy}}},<0.18010.101>,nil,30406,<<"1727712856444764">>,{user_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}],[#Fun<couch_doc.7.91987333>],nil,nil,undefined,[{default_security_object,[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},replicated_changes,{user_ctx,{user_ctx,<<"groot">>,[<<"_admin">>],<<"cookie">>}},{w,"1"},{props,[{partitioned,true},{hash,[couch_partition,hash,[]]}]}],undefined}
    extra: []
[notice] 2024-09-30T16:14:19.574938Z nonode@nohost <0.19120.101> d5dfe20e02 localhost:5984 127.0.0.1 groot POST /hzd/_bulk_docs 500 ok 21
[error] 2024-09-30T16:14:19.575102Z nonode@nohost <0.18010.101> -------- gen_server <0.18010.101> terminated with reason: no match of right hand value {error,enospc} at couch_bt_engine:write_doc_body/2(line:439) <= couch_db_updater:'-flush_trees/3-fun-0-'/6(line:384) <= couch_key_tree:mapfold_simple/4(line:464) <= couch_key_tree:mapfold_simple/4(line:473) <= couch_key_tree:mapfold/3(line:457) <= couch_db_updater:flush_trees/3(line:373) <= couch_db_updater:update_docs_int/4(line:718) <= couch_db_updater:handle_info/2(line:183)
  last msg: redacted
     state: {db,1,<<"shards/e0000000-ffffffff/hzd.1727710380">>,"./data/shards/e0000000-ffffffff/hzd.1727710380.couch",{couch_bt_engine,{st,"./data/shards/e0000000-ffffffff/hzd.1727710380.couch",<0.19406.101>,#Ref<0.3603940510.502005771.203208>,undefined,{db_header,8,30406,0,{9450247660,{29670,687,{size_info,9279630171,9278136634}},12600491},{9450249167,30357,11927090},{9448039553,[],2388},nil,nil,4251,1000,<<"2719778795232e78e860e5e8ab70c794">>,[{nonode@nohost,0}],0,1000,0},false,{btree,<0.19406.101>,{9450247660,{29670,687,{size_info,9279630171,9278136634}},12600491},fun couch_bt_engine:id_tree_split/1,fun couch_bt_engine:id_tree_join/2,undefined,fun couch_bt_engine:id_tree_reduce/2,snappy},{btree,<0.19406.101>,{9450249167,30357,11927090},fun couch_bt_engine:seq_tree_split/1,fun couch_bt_engine:seq_tree_join/2,undefined,fun couch_bt_engine:seq_tree_reduce/2,snappy},{btree,<0.19406.101>,{9448039553,[],2388},fun couch_bt_engine:local_tree_split/1,fun couch_bt_engine:local_tree_join/2,undefined,nil,snappy},snappy,{btree,<0.19406.101>,nil,fun couch_bt_engine:purge_tree_split/1,fun couch_bt_engine:purge_tree_join/2,undefined,fun couch_bt_engine:purge_tree_reduce/2,snappy},{btree,<0.19406.101>,nil,fun couch_bt_engine:purge_seq_tree_split/1,fun couch_bt_engine:purge_seq_tree_join/2,undefined,fun couch_bt_engine:purge_tree_reduce/2,snappy}}},<0.18010.101>,nil,30406,<<"1727712856444764">>,{user_ctx,null,[],undefined},[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}],[#Fun<couch_doc.7.91987333>],nil,nil,undefined,[{default_security_object,[{<<"members">>,{[{<<"roles">>,[<<"_admin">>]}]}},{<<"admins">>,{[{<<"roles">>,[<<"_admin">>]}]}}]},replicated_changes,{user_ctx,{user_ctx,<<"groot">>,[<<"_admin">>],<<"cookie">>}},{w,"1"},{props,[{partitioned,true},{hash,[couch_partition,hash,[]]}]}],undefined}
    extra: []
[error] 2024-09-30T16:14:19.575128Z nonode@nohost <0.14636.101> -------- Replicator, request POST to "http://localhost:5984/hzd/_bulk_docs" failed due to error {code,500}
[error] 2024-09-30T16:14:19.575198Z nonode@nohost <0.18010.101> -------- CRASH REPORT Process  (<0.18010.101>) with 0 neighbors crashed with reason: no match of right hand value {error,enospc} at couch_bt_engine:write_doc_body/2(line:439) <= couch_db_updater:'-flush_trees/3-fun-0-'/6(line:384) <= couch_key_tree:mapfold_simple/4(line:464) <= couch_key_tree:mapfold_simple/4(line:473)

grafik

SourceR85 commented 1 month ago

My fault: I'm using Docker Desktop, the max. storage capacity was globally set to 100GB and the source (CouchDB 3.3.3) is running in parallel, so I can replicate from it... My assumption was, that I'm running docker without limits.

So nickva spotted it right on his first comment:

enospc from no match of right hand value {error,enospc} indicates we're probably running out of disk space [1]

It should be a more friendly message in the log, but at least first sight that's what's jumping out.

[1] https://www.man7.org/linux/man-pages/man3/errno.3.html

There may be two ideas for improvement, that I can provide from my fault:

  1. A more user friendly error message than {error,enospc}.
  2. Quit CouchDB on that error (since health-checks run fine, as long as the endpoints are reachable) or report an unhealthy status in _up endpoint (507 Insufficient Storage may fit for this purpose).
nickva commented 1 month ago

No worries at all, thanks for reaching out.

Yeah, agree a more friendly error would be nice in the logs.

And it turns out we do have a disk monitor now in 3.4 (the work of @rnewson)!

https://docs.couchdb.org/en/stable/config/disk-monitor.html if you configure it, it will stop indexing when approaching the limit and return a meaningful API error.

See https://github.com/apache/couchdb/pull/4681 for the PR comments and the implementation.