Closed sergey-safarov closed 6 years ago
Paging @kocolosk once again this feels like a problem with a Docker image. CouchDB in the default image should be flushing to disk after every operation; is it possible that Docker doesn't allow that and this is a virtualisation error?
For others, so you don't have to sign up for a JIRA account, here's the text (2600/teletype):
On debian 8.6 ander openvz container fax attachment cannot be attached to email. Root of issue call "file:read_file(ToFile)" return enoent Fragment of strace cammnad output
stat("/tmp/a0b205e547697bcd.pdf", 0x7f5a29a1ed00) = -1 ENOENT (No such file or directory)
Think OTP have bug that appears only Debian 8.6 virtualized under openvz on Proxmox 3.4 (two years old). Tested commandtiff2pdf -o ~s ~s &> /dev/null && echo -n \"success\""
on OTP-18.3.4.5
@sergey-safarov This doesn't directly apply because the operation to open a CouchDB shard file is not an os:cmd
operation, but an internal Erlang operation.
You may be right that in your specific configuration there is an issue, but the above information points more likely to a very slow disk (unable to open file for 4.0 seconds is very very slow.)
Are you able to reproduce this issue when you are not running under Docker and proxmox? We see sporadic reports of this type of behaviour and don't really have a good way to resolve this. Increasing this timeout as a default could lead to very bad performance for CouchDB in other situations; putting every single timeout into local.ini
is not a good solution, either.
Hello @wohali About slow disk. Docker files located at RAID10 disk created from 6 SATA disk. Short test demonstrate that used partitions is not slow.
[root@node0 backup]# date; dd if=/dev/zero of=/var/lib/docker/volumes/test.img bs=1G count=10; date
Thu Feb 1 20:47:31 GMT 2018
10+0 records in
10+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 5.62274 s, 1.9 GB/s
Thu Feb 1 20:47:36 GMT 2018
[root@node0 backup]#
I cannot use couchdb 2.1.1
, issue https://github.com/apache/couchdb/issues/1119 is blocked me.
@wohali how I can measure file opening operation for CouchDB?
May be via logs?
It's not about the speed of the disk, it's about the latency of disk operations from inside of the Docker image.
Your log shows that, on occasion, the Docker environment has very slow disk performance.
CouchDB performance is directly related to the latency of disk I/O operations.
Have you tried using the device storage mapper device on real block devices rather than loopback mounted block devices? See https://github.com/moby/moby/issues/21485 for more information:
Joan (@wohali) on provided link is not used volumes and used disk image via loopback + docker filesystem overlay. Your example is not applicable to this ticket.
On my host is installed Fedora 27 operation system + docker + overlay2 storage driver. For Fedora is recommended
Linux distribution Recommended storage drivers Docker CE on Fedora devicemapper, overlay2 (Fedora 26 or later, experimental), overlay (experimental), vfs When possible, overlay2 is the recommended storage driver. When installing Docker for the first time, overlay2 is used by default.
But i use volume
as permanent store for couchdb data and according this page is not mater what is used storage driver
A data volume is a directory or file in the Docker host’s filesystem that is mounted directly into a container. Data volumes are not controlled by the storage driver. Reads and writes to data volumes bypass the storage driver and operate at native host speeds.
I have googled about slow disk operation inside docker container. I find only related to docker for MacOS (not linux).
Also i have made test inside docker container
[root@node0 ~]# docker run --rm -it -v test:/home fedora:27 /bin/sh
sh-4.4# date; dd if=/dev/zero of=/home/test.img bs=1G count=10; date
Fri Feb 2 02:19:51 UTC 2018
10+0 records in
10+0 records out
10737418240 bytes (11 GB, 10 GiB) copied, 5.12558 s, 2.1 GB/s
Fri Feb 2 02:19:56 UTC 2018
sh-4.4#
Results is same to test on host operation system.
Joan (@wohali) my mistake. Link you provided is applicable. I will made additional test
Joan (@wohali)
First i measured disk io operation when tmpfs
mounted as volume inside container.
[root@node0 ~]# docker run --rm --read-only -it --mount type=tmpfs,destination=/opt/couchdb/data fedora:27 bash
[root@c0ed791b0e7d /]# rm -f /workspace/* && time dd if=/dev/zero of=/opt/couchdb/data/test_container1.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.00277375 s, 185 MB/s
real 0m0.004s
user 0m0.003s
sys 0m0.001s
[root@c0ed791b0e7d /]# rm -f /workspace/* && time dd if=/dev/zero of=/opt/couchdb/data/test_container1.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.00277794 s, 184 MB/s
real 0m0.004s
user 0m0.001s
sys 0m0.003s
[root@c0ed791b0e7d /]# rm -f /workspace/* && time dd if=/dev/zero of=/opt/couchdb/data/test_container1.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.00283199 s, 181 MB/s
real 0m0.004s
user 0m0.001s
sys 0m0.003s
[root@c0ed791b0e7d /]# rm -f /workspace/* && time dd if=/dev/zero of=/opt/couchdb/data/test_container1.img bs=512 count=1000 oflag=dsync
1000+0 records in
1000+0 records out
512000 bytes (512 kB, 500 KiB) copied, 0.00279983 s, 183 MB/s
real 0m0.004s
user 0m0.001s
sys 0m0.003s
Then i started couchdb container with mounter tmpfs
as volume. Added arg --mount type=tmpfs,destination=/opt/couchdb/data
docker run -t --rm=true --log-driver=none --network kazoo --name couchdb1 \
--hostname couchdb1.kazoo \
--ip 10.0.9.8 \
--ulimit nofile=999999 \
--mount type=tmpfs,destination=/opt/couchdb/data \
-v /etc/kazoo/couchdb/vm.args.node1:/opt/couchdb/etc/vm.args \
apache/couchdb:2.1.1
vm.args
is same to default with added option name
[root@node0 ~]# docker exec -it couchdb1 cat /opt/couchdb/etc/vm.args
# Licensed under the Apache License, Version 2.0 (the "License"); you may not
# use this file except in compliance with the License. You may obtain a copy of
# the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
# License for the specific language governing permissions and limitations under
# the License.
# Ensure that the Erlang VM listens on a known port
-kernel inet_dist_listen_min 9100
-kernel inet_dist_listen_max 9100
# Tell kernel and SASL not to log anything
-kernel error_logger silent
-sasl sasl_error_logger false
# Use kernel poll functionality if supported by emulator
+K true
# Start a pool of asynchronous IO threads
+A 16
# Comment this line out to enable the interactive Erlang shell on startup
+Bd -noinput
-setcookie monster
-name couchdb@couchdb1.kazoo
[root@node0 ~]#
Now disk layout inside container
[root@node0 ~]# docker exec -it couchdb1 df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 268304384 76159032 192145352 29% /
tmpfs 65536 0 65536 0% /dev
tmpfs 65968072 0 65968072 0% /sys/fs/cgroup
/dev/mapper/vgRAID10-root 2086912 157692 1929220 8% /etc/couchdb
/dev/mapper/vgRAID10-docker 268304384 76159032 192145352 29% /etc/hosts
shm 65536 0 65536 0% /dev/shm
tmpfs 65968072 48 65968024 1% /opt/couchdb/data
tmpfs 65968072 0 65968072 0% /proc/scsi
tmpfs 65968072 0 65968072 0% /sys/firmware
Host total memory. Host have 128Gb of memory and used about 160 Mb of swap.
top - 01:41:43 up 17 days, 15:46, 4 users, load average: 1.78, 1.54, 0.87
Tasks: 900 total, 1 running, 283 sleeping, 0 stopped, 486 zombie
%Cpu(s): 0.8 us, 0.8 sy, 0.0 ni, 98.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 13193614+total, 12358560 free, 2934528 used, 11664305+buff/cache
KiB Swap: 31457280+total, 31441177+free, 161036 used. 12743585+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1332 root 20 0 1469936 452596 4084 S 6.3 0.3 587:28.21 beam.smp
14915 root 20 0 1297128 295484 3928 S 5.6 0.2 510:51.97 beam.smp
20715 root 20 0 54392 5120 3812 R 0.7 0.0 0:00.09 top
Then i started replication and after 5 hours i have folowing results. Disk layout inside couchdb container. CoucDB data size about 10Gb
[root@node0 ~]# docker exec -it couchdb1 df
Filesystem 1K-blocks Used Available Use% Mounted on
overlay 268304384 76134184 192170200 29% /
tmpfs 65536 0 65536 0% /dev
tmpfs 65968072 0 65968072 0% /sys/fs/cgroup
/dev/mapper/vgRAID10-root 2086912 157692 1929220 8% /etc/couchdb
/dev/mapper/vgRAID10-docker 268304384 76134184 192170200 29% /etc/hosts
shm 65536 0 65536 0% /dev/shm
tmpfs 65968072 10083456 55884616 16% /opt/couchdb/data
tmpfs 65968072 0 65968072 0% /proc/scsi
tmpfs 65968072 0 65968072 0% /sys/firmware
Host memory usage. Swap usage not changed. All CouchDB data is located inside memory.
top - 06:35:51 up 17 days, 20:40, 6 users, load average: 1.03, 0.95, 0.82
Tasks: 938 total, 1 running, 322 sleeping, 0 stopped, 486 zombie
%Cpu(s): 1.4 us, 1.1 sy, 0.0 ni, 97.2 id, 0.1 wa, 0.1 hi, 0.1 si, 0.0 st
KiB Mem : 13193614+total, 87901344 free, 3644528 used, 40390272 buff/cache
KiB Swap: 31457280+total, 31441177+free, 161036 used. 11659586+avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5269 root 20 0 56704 5360 3876 R 11.8 0.0 0:00.03 top
14915 root 20 0 1315680 307456 5852 S 11.8 0.2 550:18.26 beam.smp
1332 root 20 0 1463408 452628 5444 S 5.9 0.3 615:53.69 beam.smp
23036 root 20 0 1859228 63316 31484 S 5.9 0.0 30:43.66 dockerd
After 5 hours of replication to database located on memory disk (tmpfs
) couchdb have following errors in log
Could not open shards - total 2 error
2ea5274cd52c5bf0-201802/_local/75d3dc6ddfdfcae76e87a7acdfc5ed89 201 ok 42
Feb 02 03:50:57 node0.docker.rcsnet.ru sh[28220]: [notice] 2018-02-02T03:50:57.866252Z couchdb@couchdb1.kazoo <0.357.0> -------- couch_replicator_scheduler: Job {"75d3dc6ddfdfcae76e87a7acdfc5ed89","+create_target"} completed normally
Feb 02 03:50:57 node0.docker.rcsnet.ru sh[28220]: [notice] 2018-02-02T03:50:57.866613Z couchdb@couchdb1.kazoo <0.22131.14> 8a931c0f9d 10.0.9.8:5984 10.0.9.8 undefined POST /_replicate 200 ok 2805
Feb 02 03:50:57 node0.docker.rcsnet.ru sh[28220]: [notice] 2018-02-02T03:50:57.916222Z couchdb@couchdb1.kazoo <0.357.0> -------- couch_replicator_scheduler: Job {"80f52a6352af0019a08ff5bf307e6192","+create_target"} started as <0.10587.13>
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.599271Z couchdb@couchdb1.kazoo <0.14885.14> 4c046843ba Request to create N=3 DB but only 1 node(s)
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [notice] 2018-02-02T03:50:59.629560Z couchdb@couchdb1.kazoo <0.14885.14> 4c046843ba 10.0.9.8:5984 10.0.9.8 undefined PUT /account%2f94%2f93%2f45d6587d8f195f1620b65b1eb063-201802/ 201 ok 32
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.655204Z couchdb@couchdb1.kazoo <0.2117.13> -------- Could not open file ./data/shards/60000000-7fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459.couch: no such file or directory
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.655216Z couchdb@couchdb1.kazoo <0.2121.13> -------- Could not open file ./data/shards/40000000-5fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459.couch: no such file or directory
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [info] 2018-02-02T03:50:59.666163Z couchdb@couchdb1.kazoo <0.206.0> -------- open_result error {not_found,no_db_file} for shards/60000000-7fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [info] 2018-02-02T03:50:59.666210Z couchdb@couchdb1.kazoo <0.206.0> -------- open_result error {not_found,no_db_file} for shards/40000000-5fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [warning] 2018-02-02T03:50:59.666204Z couchdb@couchdb1.kazoo <0.2139.13> d9f2f7b03a creating missing database: shards/60000000-7fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [warning] 2018-02-02T03:50:59.666235Z couchdb@couchdb1.kazoo <0.2149.13> d9f2f7b03a creating missing database: shards/40000000-5fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.666644Z couchdb@couchdb1.kazoo <0.303.0> -------- mem3_shards tried to create shards/40000000-5fffffff/account/94/93/45d6587d8f195f1620b65b1eb063-201802.1517543459, got file_exists
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.932281Z couchdb@couchdb1.kazoo <0.16410.7> -------- rexi_server: from: couchdb@couchdb1.kazoo(<0.16407.7>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,642}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
Feb 02 03:50:59 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T03:50:59.932290Z couchdb@couchdb1.kazoo <0.16409.7> -------- rexi_server: from: couchdb@couchdb1.kazoo(<0.16407.7>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,642}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
Replicator error. Cannot put document - about 20 errors
Feb 02 04:04:51 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T04:04:51.673222Z couchdb@couchdb1.kazoo <0.8179.40> -------- rexi_server: from: couchdb@couchdb1.kazoo(<0.10464.36>) mfa: fabric_rpc:all_docs/3 exit:timeout [{rexi,init_stream,1,[{file,"src/rexi.erl"},{line,256}]},{rexi,stream2,3,[{file,"src/rexi.erl"},{line,204}]},{fabric_rpc,view_cb,2,[{file,"src/fabric_rpc.erl"},{line,308}]},{couch_mrview,finish_fold,2,[{file,"src/couch_mrview.erl"},{line,642}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}]
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T04:04:55.012851Z couchdb@couchdb1.kazoo emulator -------- Error in process <0.10651.59> on node 'couchdb@couchdb1.kazoo' with exit value: {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,613}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,664}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},...
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [info] 2018-02-02T04:04:55.013064Z couchdb@couchdb1.kazoo <0.353.0> -------- Replication connection to: "10.0.9.8":5984 died with reason {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,613}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,664}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,617}]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]}
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T04:04:55.014366Z couchdb@couchdb1.kazoo <0.7570.55> 8967ff8316 req_err(4199105376) badmatch : ok
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [<<"chttpd_db:db_doc_req/3 L782">>,<<"chttpd:process_request/1 L295">>,<<"chttpd:handle_request_int/1 L231">>,<<"mochiweb_http:headers/6 L91">>,<<"proc_lib:init_p_do_apply/3 L237">>]
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [notice] 2018-02-02T04:04:55.014881Z couchdb@couchdb1.kazoo <0.7570.55> 8967ff8316 10.0.9.8:5984 10.0.9.8 undefined PUT /account%2f0e%2fe2%2fb3fccd72d86299cf8f66f6caa6bd-201801/201801-b089783f4734b57244ea8b5613f8375a?new_edits=false 500 ok 2
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [error] 2018-02-02T04:04:55.015761Z couchdb@couchdb1.kazoo <0.5748.60> -------- Replicator, request PUT to "http://10.0.9.8:5984/account%2f0e%2fe2%2fb3fccd72d86299cf8f66f6caa6bd-201801/201801-b089783f4734b57244ea8b5613f8375a?new_edits=false" failed due to error {error,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {'EXIT',
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {{{nocatch,{mp_parser_died,noproc}},
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [{couch_att,'-foldl/4-fun-0-',3,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [{file,"src/couch_att.erl"},{line,613}]},
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {couch_att,fold_streamed_data,4,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [{file,"src/couch_att.erl"},{line,664}]},
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,617}]},
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {couch_httpd_multipart,atts_to_mp,4,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [{file,"src/couch_httpd_multipart.erl"},{line,208}]}]},
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {gen_server,call,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: [<0.13476.55>,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {send_req,
Feb 02 04:04:55 node0.docker.rcsnet.ru sh[28220]: {{url,
About replication error is important server IP 10.0.9.8
. This IP is assigned to destination CouchDB server.
Joan (@wohali), I think issue not related to disk IO speed inside container. Test is made on RAM disk (tmpfs
) with speed about 180 MB/s. That value is more greater then on real hard disk.
OK, then I do not know. Someone else will have to help you.
Hello Joan (@wohali)
Ticket is marked as waiting for user
.
Could you write what is need to do from my side?
We’ll have to wait for #1245 to land before we can debug this further. Unless anyone here is willing to run this commit in the environment described above.
Can you retry this with 2.2.0 when it comes out. Closing this out in the meantime.
Is the working theory still that this is Docker-related? I'm unaware of any virtualization going on with Docker on Linux (it's kernel namespacing, and we've verified by hand in C that normal file operations such as open(2)
, write(2)
, read(2)
, and fsync(2)
work as expected in-container and perform at native speeds).
We're experiencing this issue and cannot come up with a theory as to why a database-level RPC failure would be related to Docker's namespacing/isolation.
Is there any more information available here, or could you perhaps share any working theories on what specifically might be causing this and what subsystem(s) (e.g. at the CouchDB level, at the Erlang level, at the system level) you currently suspect?
Thanks in advance.
@browndav If you could try the same scenario outside of Docker to rule out it being Docker-specific great.
There are continuing issues around Docker and CouchDB that are ill-defined, especially when it comes to AUFS (see thread above) so we prefer to rule that out first.
If you're unable to rule this out with 2.2.0 outside of Docker, I'm afraid it may be some time before we have any solid answer here.
@browndav Looked at your linked issue and see it's not exactly the same as the OP.
FYI rexi timeouts are a more general error indicating, in general, you are overloading your cluster. Recommend increasing the # of nodes, amount of CPU cores and RAM as well.
The instance and cluster are essentially idle (load average 0.01) when we're encountering these errors, and they're occurring seconds after startup. Further, even if the server were loaded, it doesn't seem like a database system should fail purely because of increased I/O latency or availability of scheduler timeslices – in my past experience with database systems, overloaded machines would continue to attempt the work they were asked to perform rather than aborting or crashing.
There are a lot of threads above: would you mind linking to any of the specific evidence/discussion/theories that would implicate AUFS as a potential cause? In the mean time we'll work to reproduce the issue in a variety of environments and on a variety of filesystems, and can definitely help rule out (or in) filesystem-specific issues from these tests.
We're extremely interested in getting to the bottom of this issue, as our production infrastructure's reliability hinges on the stability of CouchDB.
FWIW, we're seeing the identical issue with our CouchDB datastore placed on bind mounts and non-AUFS Docker volumes. Unless the issue has to do with CouchDB binaries and/or Erlang itself being on AUFS, I'd be suspect of the filesystem-specific theory.
@browndav as stated, rexi timeouts are a more general error indicating, in general, you are overloading your cluster. Recommend increasing the # of nodes, amount of CPU cores and RAM as well.
Sorry this is something I don't have any more time for in open source CouchDB Support, if you need a paid support offering for CouchDB Support, those are available from various third parties.
Again, our server is idle. If there are metrics you require to confirm that, we'd be happy to supply them, but our server is not "overloaded", not should a database system render itself unavailable due to hypothetical load.
It's also unclear why you're attempting to divert a community bug report to commercial support. If this project has relationships with commercial support companies, please disclose them here so we know what we're dealing with.
Thanks.
@browndav No relationships - just that I don't have more time to assist you.
Someone else will have to assist at this point.
@browndav did you try increasing the fabric timeout?
@browndav To make it clear, I tend to be the first line of issue triaging for CouchDB. I don't have more time to dig into your issue, sorry. It could be a while before someone else looks at this. So, if you need an answer quickly, guaranteed, going with a paid offering from one of the available vendors would get you the support you need. Otherwise you'll have to wait until someone comes along to assist.
Sorry I couldn't be of more assistance here.
2.2.0 is out now and has additional error logging for this issue. I'm happy to look at fresh logs.
We went back to first principles and attempted to set up CouchDB 2.2.0 manually, per the non-wizard instructions at http://docs.couchdb.org/en/latest/install/setup.html. We now have a different error. The machine's load average is zero; it is idle.
The CouchDB manual suggests that the we create the _users
database once the system comes up. Unfortunately, CouchDB begins looking for the _users
database immediately, before we even have time to set up the required system databases per the instructions. The database system appears unusable from that point on. We cannot get a 200 OK
on the HTTP PUT for _users
at any point in the process, but will obviously repeat tests across all platforms we've tried today.
[info] 2018-08-15T15:19:35.688656Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_log started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.693970Z couchdb@127.0.0.1 <0.11.0> -------- Application folsom started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.765819Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_stats started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.765982Z couchdb@127.0.0.1 <0.11.0> -------- Application khash started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.777264Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_event started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.777414Z couchdb@127.0.0.1 <0.11.0> -------- Application hyper started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.786869Z couchdb@127.0.0.1 <0.11.0> -------- Application ibrowse started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.795983Z couchdb@127.0.0.1 <0.11.0> -------- Application ioq started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.796157Z couchdb@127.0.0.1 <0.11.0> -------- Application mochiweb started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.809087Z couchdb@127.0.0.1 <0.211.0> -------- Apache CouchDB 2.2.0 is starting.
[info] 2018-08-15T15:19:35.809156Z couchdb@127.0.0.1 <0.212.0> -------- Starting couch_sup
[notice] 2018-08-15T15:19:35.822077Z couchdb@127.0.0.1 <0.99.0> -------- config: [features] pluggable-storage-engines set to true for reason nil
[notice] 2018-08-15T15:19:35.871635Z couchdb@127.0.0.1 <0.99.0> -------- config: [couchdb] uuid set to e9dd318e72cd3eff60a52da3b2e3df1b for reason nil
[info] 2018-08-15T15:19:35.929686Z couchdb@127.0.0.1 <0.217.0> -------- open_result error {not_found,no_db_file} for _users
[info] 2018-08-15T15:19:35.985482Z couchdb@127.0.0.1 <0.211.0> -------- Apache CouchDB has started. Time to relax.
[info] 2018-08-15T15:19:35.985561Z couchdb@127.0.0.1 <0.211.0> -------- Apache CouchDB has started on http://127.0.0.1:5986/
[info] 2018-08-15T15:19:35.985846Z couchdb@127.0.0.1 <0.11.0> -------- Application couch started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:35.986067Z couchdb@127.0.0.1 <0.11.0> -------- Application ets_lru started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.001712Z couchdb@127.0.0.1 <0.11.0> -------- Application rexi started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.013619Z couchdb@127.0.0.1 <0.217.0> -------- open_result error {not_found,no_db_file} for _nodes
[warning] 2018-08-15T15:19:36.013665Z couchdb@127.0.0.1 <0.287.0> -------- creating missing database: _nodes
[info] 2018-08-15T15:19:36.041166Z couchdb@127.0.0.1 <0.217.0> -------- open_result error {not_found,no_db_file} for _dbs
[warning] 2018-08-15T15:19:36.041202Z couchdb@127.0.0.1 <0.301.0> -------- creating missing database: _dbs
[warning] 2018-08-15T15:19:36.041242Z couchdb@127.0.0.1 <0.300.0> -------- creating missing database: _dbs
[info] 2018-08-15T15:19:36.047811Z couchdb@127.0.0.1 <0.11.0> -------- Application mem3 started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.048139Z couchdb@127.0.0.1 <0.11.0> -------- Application fabric started on node
'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.068172Z couchdb@127.0.0.1 <0.11.0> -------- Application chttpd started on node
'couchdb@127.0.0.1'
[notice] 2018-08-15T15:19:36.079238Z couchdb@127.0.0.1 <0.338.0> -------- chttpd_auth_cache changes listener died database_does_not_exist at mem3_shards:load_shards_from_db/6(line:395) <= mem3_shards:load_shards_from_disk/1(line:370) <= mem3_shards:load_shards_from_disk/2(line:399) <= mem3_shards:for_docid/3(line:86) <= fabric_doc_open:go/3(line:38) <= chttpd_auth_cache:ensure_auth_ddoc_exists/2(line:187) <= chttpd_auth_cache:listen_for_changes/1(line:134)
[error] 2018-08-15T15:19:36.079511Z couchdb@127.0.0.1 emulator -------- Error in process <0.339.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,395}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,370}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,399}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,86}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,38}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,187}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,134}]}]}
[info] 2018-08-15T15:19:36.084896Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_index started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.085009Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_mrview started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.085056Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_plugins started on node 'couchdb@127.0.0.1'
[notice] 2018-08-15T15:19:36.114719Z couchdb@127.0.0.1 <0.99.0> -------- config: [features] scheduler set to true for reason nil
[info] 2018-08-15T15:19:36.134469Z couchdb@127.0.0.1 <0.217.0> -------- open_result error {not_found,no_db_file} for _replicator
[notice] 2018-08-15T15:19:36.138139Z couchdb@127.0.0.1 <0.355.0> -------- creating replicator ddoc <<"_replicator">>
[info] 2018-08-15T15:19:36.153545Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_replicator started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.162640Z couchdb@127.0.0.1 <0.11.0> -------- Application couch_peruser started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.175302Z couchdb@127.0.0.1 <0.11.0> -------- Application ddoc_cache started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.197073Z couchdb@127.0.0.1 <0.11.0> -------- Application global_changes started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.197097Z couchdb@127.0.0.1 <0.11.0> -------- Application jiffy started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.203955Z couchdb@127.0.0.1 <0.11.0> -------- Application mango started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.212173Z couchdb@127.0.0.1 <0.11.0> -------- Application setup started on node 'couchdb@127.0.0.1'
[info] 2018-08-15T15:19:36.212221Z couchdb@127.0.0.1 <0.11.0> -------- Application snappy started on node 'couchdb@127.0.0.1'
[notice] 2018-08-15T15:19:41.082305Z couchdb@127.0.0.1 <0.338.0> -------- chttpd_auth_cache changes listener died database_does_not_exist at mem3_shards:load_shards_from_db/6(line:395) <= mem3_shards:load_shards_from_disk/1(line:370) <= mem3_shards:load_shards_from_disk/2(line:399) <= mem3_shards:for_docid/3(line:86) <= fabric_doc_open:go/3(line:38) <= chttpd_auth_cache:ensure_auth_ddoc_exists/2(line:187) <= chttpd_auth_cache:listen_for_changes/1(line:134)
[error] 2018-08-15T15:19:41.082706Z couchdb@127.0.0.1 emulator -------- Error in process <0.477.0> on node 'couchdb@127.0.0.1' with exit value:
{database_does_not_exist,[{mem3_shards,load_shards_from_db,"_users",[{file,"src/mem3_shards.erl"},{line,395}]},{mem3_shards,load_shards_from_disk,1,[{file,"src/mem3_shards.erl"},{line,370}]},{mem3_shards,load_shards_from_disk,2,[{file,"src/mem3_shards.erl"},{line,399}]},{mem3_shards,for_docid,3,[{file,"src/mem3_shards.erl"},{line,86}]},{fabric_doc_open,go,3,[{file,"src/fabric_doc_open.erl"},{line,38}]},{chttpd_auth_cache,ensure_auth_ddoc_exists,2,[{file,"src/chttpd_auth_cache.erl"},{line,187}]},{chttpd_auth_cache,listen_for_changes,1,[{file,"src/chttpd_auth_cache.erl"},{line,134}]}]}
The error messages in that logfile you see are normal and are not the database "crashing." Proceed to join the nodes and create the _users database once you've joined the nodes together.
If you create the _users database prior to joining the nodes together, the database will not be distributed across those additional nodes and you will be placing all of the _users database onto a single node.
Thanks for your help. Given that this isn't fabric-related anymore and we've managed to mitigate our failed PUT request issue (delay inserted; we may have been racing with some component during startup; who knows), I'm happy to drop this. If we see fabric-related issues under 2.2.0+ we'll open an issue.
Thanks again.
@browndav Sorry for the frustration in resolving this issue on your behalf.
(In a single node cluster, just create the databases, no need to join nodes together.)
Expected Behavior
CouchDB must open shards without any error.
Current Behavior
When replicated database from other server to standalone CouchDB time to time i see errors like this
Possible Solution
Not sure but this may by thread sync issue. Similar error is described here for other erlang program. link 1 link2
Steps to Reproduce (for bugs)
Context
I want migrate date from other CouchDB server.
Your Environment