apache / couchdb-helm

Apache CouchDB Helm Chart
https://couchdb.apache.org/
Apache License 2.0
49 stars 64 forks source link

Coordinator node regularly restarts in 3 node cluster #90

Open cg1972 opened 2 years ago

cg1972 commented 2 years ago

Describe the bug We have used the helm charts to install a 3 node couchdb cluster. We have noticed that one of the nodes in the cluster (coordinator node) is restarting on a regular basis, usually once a day.

The couchdb pod error is Container couchdb failed liveness probe, will be restarted

The couchdb logs indicate the following errors: `[notice] 2022-06-23T06:23:43.733589Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30690.22> b938e4d3fa 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 29613 [notice] 2022-06-23T06:23:45.153723Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30707.22> 7bc4d55e3e 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 14872 [notice] 2022-06-23T06:23:45.154154Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30697.22> d48973481d 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 13 [error] 2022-06-23T06:23:45.273373Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30706.22> 45a001ddb1 req_err(2751202856) timeout : The request could not be processed in a reasonable amount of time. [<<"gen_server:call/2 L238">>,<<"chttpd_misc:handle_up_req/1 L274">>,<<"chttpd:handle_req_after_auth/2 L327">>,<<"chttpd:process_request/1 L310">>,<<"chttpd:handle_request_int/1 L249">>,<<"mochiweb_http:headers/6 L150">>,<<"proc_lib:init_p_do_apply/3 L226">>] [error] 2022-06-23T06:23:45.274458Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30691.22> c975e456eb req_err(2751202856) timeout : The request could not be processed in a reasonable amount of time. [<<"gen_server:call/2 L238">>,<<"chttpd_misc:handle_up_req/1 L274">>,<<"chttpd:handle_req_after_auth/2 L327">>,<<"chttpd:process_request/1 L310">>,<<"chttpd:handle_request_int/1 L249">>,<<"mochiweb_http:headers/6 L150">>,<<"proc_lib:init_p_do_apply/3 L226">>] [notice] 2022-06-23T06:23:45.274005Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30706.22> 45a001ddb1 192.168.230.108:5984 10.1.2.179 undefined GET /_up 500 ok 15095 [notice] 2022-06-23T06:23:45.274958Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30691.22> c975e456eb 192.168.230.108:5984 10.1.2.179 undefined GET /_up 500 ok 34630 [error] 2022-06-23T06:23:45.915833Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.23685.4> -------- gen_server couch_prometheus_server terminated with reason: {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} at gen_server:call/2(line:238) <= couch_stats_aggregator:fetch/0(line:44) <= couch_prometheus_server:get_couchdb_stats/0(line:94) <= couch_prometheus_server:refresh_metrics/0(line:87) <= couch_prometheus_server:handle_info/2(line:74) <= gen_server:try_dispatch/4(line:689) <= gen_server:handle_msg/6(line:765) <= proc_lib:init_p_do_apply/3(line:226) last msg: redacted state: {st,<<"# TYPE couchdb_couch_log_requests_total counter\ncouchdb_couch_log_requests_total{level=\"alert\"} 0\ncouchdb_couch_log_requests_total{level=\"critical\"} 0\ncouchdb_couch_log_requests_total{level=\"debug\"} 0\ncouchdb_couch_log_requests_total{level=\"emergency\"} 0\ncouchdb_couch_log_requests_total{level=\"error\"} 0\ncouchdb_couch_log_requests_total{level=\"info\"} 7\ncouchdb_couch_log_requests_total{level=\"notice\"} 18573\ncouchdb_couch_log_requests_total{level=\"warning\"} 0\n# TYPE couchdb_couch_replicator_changes_manager_deaths_total counter\ncouchdb_couch_replicator_changes_manager_deaths_total 0\n# TYPE couchdb_couch_replicator_changes_queue_deaths_total counter\ncouchdb_couch_replicator_changes_queue_deaths_total 0\n# TYPE couchdb_couch_replicator_changes_read_failures_total counter\ncouchdb_couch_replicator_changes_read_failures_total 0\n# TYPE couchdb_couch_replicator_changes_reader_deaths_total counter\ncouchdb_couch_replicator_changes_reader_deaths_total 0\n# TYPE couchdb_couch_replicator_checkpoints_failure_total counter\ncouchdb_couch_replicator_checkpoints_failure_total 0\n# TYPE couchdb_couch_replicator_checkpoints_total counter\ncouchdb_couch_replicator_checkpoints_total 0\n# TYPE couchdb_couch_replicator_cluster_is_stable gauge\ncouchdb_couch_replicator_cluster_is_stable 1\n# TYPE couchdb_couch_replicator_connection_acquires_total counter\ncouchdb_couch_replicator_connection_acquires_total 0\n# TYPE couchdb_couch_replicator_connection_closes_total counter\ncouchdb_couch_replicator_connection_closes_total 0\n# TYPE couchdb_couch_replicator_connection_creates_total counter\ncouchdb_couch_replicator_connection_creates_total 0\n# TYPE couchdb_couch_replicator_connection_owner_crashes_total counter\ncouchdb_couch_replicator_connection_owner_crashes_total 0\n# TYPE couchdb_couch_replicator_connection_releases_total counter\ncouchdb_couch_replicator_connection_releases_total 0\n# TYPE couchdb_couch_replicator_connection_worker_crashes_total counter\ncouchdb_couch_replicator_connection_worker_crashes_total 0\n# TYPE couchdb_couch_replicator_db_scans_total counter\ncouchdb_couch_replicator_db_scans_total 1\n# TYPE couchdb_couch_replicator_docs_completed_state_updates_total counter\ncouchdb_couch_replicator_docs_completed_state_updates_total 0\n# TYPE couchdb_couch_replicator_docs_db_changes_total counter\ncouchdb_couch_replicator_docs_db_changes_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_created_total counter\ncouchdb_couch_replicator_docs_dbs_created_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_deleted_total counter\ncouchdb_couch_replicator_docs_dbs_deleted_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_found_total counter\ncouchdb_couch_replicator_docs_dbs_found_total 2\n# TYPE couchdb_couch_replicator_docs_failed_state_updates_total counter\ncouchdb_couch_replicator_docs_failed_state_updates_total 0\n# TYPE couchdb_couch_replicator_failed_starts_total counter\ncouchdb_couch_replicator_failed_starts_total 0\n# TYPE couchdb_couch_replicator_jobs_adds_total counter\ncouchdb_couch_replicator_jobs_adds_total 0\n# TYPE couchdb_couch_replicator_jobs_crashed gauge\ncouchdb_couch_replicator_jobs_crashed 0\n# TYPE couchdb_couch_replicator_jobs_crashes_total counter\ncouchdb_couch_replicator_jobs_crashes_total 0\n# TYPE couchdb_couch_replicator_jobs_duplicate_adds_total counter\ncouchdb_couch_replicator_jobs_duplicate_adds_total 0\n# TYPE couchdb_couch_replicator_jobs_pending gauge\ncouchdb_couch_replicator_jobs_pending 0\n# TYPE couchdb_couch_replicator_jobs_removes_total counter\ncouchdb_couch_replicator_jobs_removes_total 0\n# TYPE couchdb_couch_replicator_jobs_running gauge\ncouchdb_couch_replicator_jobs_running 0\n# TYPE couchdb_couch_replicator_jobs_starts_total counter\ncouchdb_couch_replicator_jobs_starts_total 0\n# TYPE couchdb_couch_replicator_jobs_stops_total counter\ncouchdb_couch_replicator_jobs_stops_total 0\n# TYPE couchdb_couch_replicator_jobs_total gauge\ncouchdb_couch_replicator_jobs_total 0\n# TYPE couchdb_couch_replicator_requests_total counter\ncouchdb_couch_replicator_requests_total 0\n# TYPE couchdb_couch_replicator_responses_failure_total counter\ncouchdb_couch_replicator_responses_failure_total 0\n# TYPE couchdb_couch_replicator_responses_total counter\ncouchdb_couch_replicator_responses_total 0\n# TYPE couchdb_couch_replicator_stream_responses_failure_total counter\ncouchdb_couch_replicator_stream_responses_failure_total 0\n# TYPE couchdb_couch_replicator_stream_responses_total counter\ncouchdb_couch_replicator_stream_responses_total 0\n# TYPE couchdb_couch_replicator_worker_deaths_total counter\ncouchdb_couch_replicator_worker_deaths_total 0\n# TYPE couchdb_couch_replicator_workers_started_total counter\ncouchdb_couch_replicator_workers_started_total 0\n# TYPE couchdb_auth_cache_requests_total counter\ncouchdb_auth_cache_requests_total 0\n# TYPE couchdb_auth_cache_misses_total counter\ncouchdb_auth_cache_misses_total 0\n# TYPE couchdb_collect_results_time_seconds summary\ncouchdb_collect_results_time_seconds{quantile=\"0.5\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.75\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.9\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.95\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.99\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.999\"} 0.0\ncouchdb_collect_results_time_seconds_sum 0.0\ncouchdb_collect_results_time_seconds_count 0\n# TYPE couchdb_couch_server_lru_skip_total counter\ncouchdb_couch_server_lru_skip_total 0\n# TYPE couchdb_database_purges_total counter\ncouchdb_database_purges_total 0\n# TYPE couchdb_database_reads_total counter\ncouchdb_database_reads_total 24\n# TYPE couchdb_database_writes_total counter\ncouchdb_database_writes_total 0\n# TYPE couchdb_db_open_time_seconds summary\ncouchdb_db_open_time_seconds{quantile=\"0.5\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.75\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.9\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.95\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.99\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.999\"} 0.0\ncouchdb_db_open_time_seconds_sum 0.0\ncouchdb_db_open_time_seconds_count 0\n# TYPE couchdb_dbinfo_seconds summary\ncouchdb_dbinfo_seconds{quantile=\"0.5\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.75\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.9\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.95\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.99\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.999\"} 0.0\ncouchdb_dbinfo_seconds_sum 0.0\ncouchdb_dbinfo_seconds_count 0\n# TYPE couchdb_document_inserts_total counter\ncouchdb_document_inserts_total 7\n# TYPE couchdb_document_purges_failure_total counter\ncouchdb_document_purges_failure_total 0\n# TYPE couchdb_document_purges_success_total counter\ncouchdb_document_purges_success_total 0\n# TYPE couchdb_document_purges_total_total counter\ncouchdb_document_purges_total_total 0\n# TYPE couchdb_document_writes_total counter\ncouchdb_document_writes_total 14\n# TYPE couchdb_httpd_aborted_requests_total counter\ncouchdb_httpd_aborted_requests_total 0\n# TYPE couchdb_httpd_all_docs_timeouts_total counter\ncouchdb_httpd_all_docs_timeouts_total 0\n# TYPE couchdb_httpd_bulk_docs_seconds summary\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.5\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.75\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.9\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.95\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.99\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.999\"} 0.0\ncouchdb_httpd_bulk_docs_seconds_sum 0.0\ncouchdb_httpd_bulk_docs_seconds_count 0\n# TYPE couchdb_httpd_bulk_requests_total counter\ncouchdb_httpd_bulk_requests_total 0\n# TYPE couchdb_httpd_clients_requesting_changes_total counter\ncouchdb_httpd_clients_requesting_changes_total 0\n...">>,...} extra: [] [error] 2022-06-23T06:23:45.938589Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.23685.4> -------- gen_server couch_prometheus_server terminated with reason: {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} at gen_server:call/2(line:238) <= couch_stats_aggregator:fetch/0(line:44) <= couch_prometheus_server:get_couchdb_stats/0(line:94) <= couch_prometheus_server:refresh_metrics/0(line:87) <= couch_prometheus_server:handle_info/2(line:74) <= gen_server:try_dispatch/4(line:689) <= gen_server:handle_msg/6(line:765) <= proc_lib:init_p_do_apply/3(line:226) last msg: redacted state: {st,<<"# TYPE couchdb_couch_log_requests_total counter\ncouchdb_couch_log_requests_total{level=\"alert\"} 0\ncouchdb_couch_log_requests_total{level=\"critical\"} 0\ncouchdb_couch_log_requests_total{level=\"debug\"} 0\ncouchdb_couch_log_requests_total{level=\"emergency\"} 0\ncouchdb_couch_log_requests_total{level=\"error\"} 0\ncouchdb_couch_log_requests_total{level=\"info\"} 7\ncouchdb_couch_log_requests_total{level=\"notice\"} 18573\ncouchdb_couch_log_requests_total{level=\"warning\"} 0\n# TYPE couchdb_couch_replicator_changes_manager_deaths_total counter\ncouchdb_couch_replicator_changes_manager_deaths_total 0\n# TYPE couchdb_couch_replicator_changes_queue_deaths_total counter\ncouchdb_couch_replicator_changes_queue_deaths_total 0\n# TYPE couchdb_couch_replicator_changes_read_failures_total counter\ncouchdb_couch_replicator_changes_read_failures_total 0\n# TYPE couchdb_couch_replicator_changes_reader_deaths_total counter\ncouchdb_couch_replicator_changes_reader_deaths_total 0\n# TYPE couchdb_couch_replicator_checkpoints_failure_total counter\ncouchdb_couch_replicator_checkpoints_failure_total 0\n# TYPE couchdb_couch_replicator_checkpoints_total counter\ncouchdb_couch_replicator_checkpoints_total 0\n# TYPE couchdb_couch_replicator_cluster_is_stable gauge\ncouchdb_couch_replicator_cluster_is_stable 1\n# TYPE couchdb_couch_replicator_connection_acquires_total counter\ncouchdb_couch_replicator_connection_acquires_total 0\n# TYPE couchdb_couch_replicator_connection_closes_total counter\ncouchdb_couch_replicator_connection_closes_total 0\n# TYPE couchdb_couch_replicator_connection_creates_total counter\ncouchdb_couch_replicator_connection_creates_total 0\n# TYPE couchdb_couch_replicator_connection_owner_crashes_total counter\ncouchdb_couch_replicator_connection_owner_crashes_total 0\n# TYPE couchdb_couch_replicator_connection_releases_total counter\ncouchdb_couch_replicator_connection_releases_total 0\n# TYPE couchdb_couch_replicator_connection_worker_crashes_total counter\ncouchdb_couch_replicator_connection_worker_crashes_total 0\n# TYPE couchdb_couch_replicator_db_scans_total counter\ncouchdb_couch_replicator_db_scans_total 1\n# TYPE couchdb_couch_replicator_docs_completed_state_updates_total counter\ncouchdb_couch_replicator_docs_completed_state_updates_total 0\n# TYPE couchdb_couch_replicator_docs_db_changes_total counter\ncouchdb_couch_replicator_docs_db_changes_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_created_total counter\ncouchdb_couch_replicator_docs_dbs_created_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_deleted_total counter\ncouchdb_couch_replicator_docs_dbs_deleted_total 0\n# TYPE couchdb_couch_replicator_docs_dbs_found_total counter\ncouchdb_couch_replicator_docs_dbs_found_total 2\n# TYPE couchdb_couch_replicator_docs_failed_state_updates_total counter\ncouchdb_couch_replicator_docs_failed_state_updates_total 0\n# TYPE couchdb_couch_replicator_failed_starts_total counter\ncouchdb_couch_replicator_failed_starts_total 0\n# TYPE couchdb_couch_replicator_jobs_adds_total counter\ncouchdb_couch_replicator_jobs_adds_total 0\n# TYPE couchdb_couch_replicator_jobs_crashed gauge\ncouchdb_couch_replicator_jobs_crashed 0\n# TYPE couchdb_couch_replicator_jobs_crashes_total counter\ncouchdb_couch_replicator_jobs_crashes_total 0\n# TYPE couchdb_couch_replicator_jobs_duplicate_adds_total counter\ncouchdb_couch_replicator_jobs_duplicate_adds_total 0\n# TYPE couchdb_couch_replicator_jobs_pending gauge\ncouchdb_couch_replicator_jobs_pending 0\n# TYPE couchdb_couch_replicator_jobs_removes_total counter\ncouchdb_couch_replicator_jobs_removes_total 0\n# TYPE couchdb_couch_replicator_jobs_running gauge\ncouchdb_couch_replicator_jobs_running 0\n# TYPE couchdb_couch_replicator_jobs_starts_total counter\ncouchdb_couch_replicator_jobs_starts_total 0\n# TYPE couchdb_couch_replicator_jobs_stops_total counter\ncouchdb_couch_replicator_jobs_stops_total 0\n# TYPE couchdb_couch_replicator_jobs_total gauge\ncouchdb_couch_replicator_jobs_total 0\n# TYPE couchdb_couch_replicator_requests_total counter\ncouchdb_couch_replicator_requests_total 0\n# TYPE couchdb_couch_replicator_responses_failure_total counter\ncouchdb_couch_replicator_responses_failure_total 0\n# TYPE couchdb_couch_replicator_responses_total counter\ncouchdb_couch_replicator_responses_total 0\n# TYPE couchdb_couch_replicator_stream_responses_failure_total counter\ncouchdb_couch_replicator_stream_responses_failure_total 0\n# TYPE couchdb_couch_replicator_stream_responses_total counter\ncouchdb_couch_replicator_stream_responses_total 0\n# TYPE couchdb_couch_replicator_worker_deaths_total counter\ncouchdb_couch_replicator_worker_deaths_total 0\n# TYPE couchdb_couch_replicator_workers_started_total counter\ncouchdb_couch_replicator_workers_started_total 0\n# TYPE couchdb_auth_cache_requests_total counter\ncouchdb_auth_cache_requests_total 0\n# TYPE couchdb_auth_cache_misses_total counter\ncouchdb_auth_cache_misses_total 0\n# TYPE couchdb_collect_results_time_seconds summary\ncouchdb_collect_results_time_seconds{quantile=\"0.5\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.75\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.9\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.95\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.99\"} 0.0\ncouchdb_collect_results_time_seconds{quantile=\"0.999\"} 0.0\ncouchdb_collect_results_time_seconds_sum 0.0\ncouchdb_collect_results_time_seconds_count 0\n# TYPE couchdb_couch_server_lru_skip_total counter\ncouchdb_couch_server_lru_skip_total 0\n# TYPE couchdb_database_purges_total counter\ncouchdb_database_purges_total 0\n# TYPE couchdb_database_reads_total counter\ncouchdb_database_reads_total 24\n# TYPE couchdb_database_writes_total counter\ncouchdb_database_writes_total 0\n# TYPE couchdb_db_open_time_seconds summary\ncouchdb_db_open_time_seconds{quantile=\"0.5\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.75\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.9\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.95\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.99\"} 0.0\ncouchdb_db_open_time_seconds{quantile=\"0.999\"} 0.0\ncouchdb_db_open_time_seconds_sum 0.0\ncouchdb_db_open_time_seconds_count 0\n# TYPE couchdb_dbinfo_seconds summary\ncouchdb_dbinfo_seconds{quantile=\"0.5\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.75\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.9\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.95\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.99\"} 0.0\ncouchdb_dbinfo_seconds{quantile=\"0.999\"} 0.0\ncouchdb_dbinfo_seconds_sum 0.0\ncouchdb_dbinfo_seconds_count 0\n# TYPE couchdb_document_inserts_total counter\ncouchdb_document_inserts_total 7\n# TYPE couchdb_document_purges_failure_total counter\ncouchdb_document_purges_failure_total 0\n# TYPE couchdb_document_purges_success_total counter\ncouchdb_document_purges_success_total 0\n# TYPE couchdb_document_purges_total_total counter\ncouchdb_document_purges_total_total 0\n# TYPE couchdb_document_writes_total counter\ncouchdb_document_writes_total 14\n# TYPE couchdb_httpd_aborted_requests_total counter\ncouchdb_httpd_aborted_requests_total 0\n# TYPE couchdb_httpd_all_docs_timeouts_total counter\ncouchdb_httpd_all_docs_timeouts_total 0\n# TYPE couchdb_httpd_bulk_docs_seconds summary\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.5\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.75\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.9\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.95\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.99\"} 0.0\ncouchdb_httpd_bulk_docs_seconds{quantile=\"0.999\"} 0.0\ncouchdb_httpd_bulk_docs_seconds_sum 0.0\ncouchdb_httpd_bulk_docs_seconds_count 0\n# TYPE couchdb_httpd_bulk_requests_total counter\ncouchdb_httpd_bulk_requests_total 0\n# TYPE couchdb_httpd_clients_requesting_changes_total counter\ncouchdb_httpd_clients_requesting_changes_total 0\n...">>,...} extra: [] [error] 2022-06-23T06:23:45.953809Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.23685.4> -------- CRASH REPORT Process couch_prometheus_server (<0.23685.4>) with 0 neighbors exited with reason: {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} at gen_server:call/2(line:238) <= couch_stats_aggregator:fetch/0(line:44) <= couch_prometheus_server:get_couchdb_stats/0(line:94) <= couch_prometheus_server:refresh_metrics/0(line:87) <= couch_prometheus_server:handle_info/2(line:74) <= gen_server:try_dispatch/4(line:689) <= gen_server:handle_msg/6(line:765) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_prometheus_server,init,['Argument1']}, ancestors: [couch_prometheus_sup,<0.251.0>], message_queue_len: 1, links: [<0.252.0>], dictionary: [], trap_exit: false, status: running, heap_size: 46422, stack_size: 28, reductions: 5547311068 [error] 2022-06-23T06:23:45.954202Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.23685.4> -------- CRASH REPORT Process couch_prometheus_server (<0.23685.4>) with 0 neighbors exited with reason: {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} at gen_server:call/2(line:238) <= couch_stats_aggregator:fetch/0(line:44) <= couch_prometheus_server:get_couchdb_stats/0(line:94) <= couch_prometheus_server:refresh_metrics/0(line:87) <= couch_prometheus_server:handle_info/2(line:74) <= gen_server:try_dispatch/4(line:689) <= gen_server:handle_msg/6(line:765) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_prometheus_server,init,['Argument1']}, ancestors: [couch_prometheus_sup,<0.251.0>], message_queue_len: 1, links: [<0.252.0>], dictionary: [], trap_exit: false, status: running, heap_size: 46422, stack_size: 28, reductions: 5547311068 [error] 2022-06-23T06:23:46.044832Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.252.0> -------- Supervisor couch_prometheus_sup had child couch_prometheus_server started with couch_prometheus_server:start_link() at <0.23685.4> exit with reason {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} in context child_terminated [error] 2022-06-23T06:23:46.044957Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.252.0> -------- Supervisor couch_prometheus_sup had child couch_prometheus_server started with couch_prometheus_server:start_link() at <0.23685.4> exit with reason {timeout,{gen_server,call,[couch_stats_aggregator,fetch]}} in context child_terminated [notice] 2022-06-23T06:23:46.364711Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30722.22> 4688407aa4 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 49 [notice] 2022-06-23T06:24:06.671559Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30883.22> 54b4dc40cb 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 150 [notice] 2022-06-23T06:24:09.260972Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30884.22> e34d093bcd 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 1660 [info] 2022-06-23T06:24:09.602627Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.40.0> -------- SIGTERM received - shutting down

[info] 2022-06-23T06:24:09.602724Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.40.0> -------- SIGTERM received - shutting down

[notice] 2022-06-23T06:24:14.600448Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.30916.22> c6896d1f00 192.168.230.108:5984 10.1.2.179 undefined GET /_up 200 ok 56 [error] 2022-06-23T06:24:18.961753Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.811.0> -------- gen_server <0.811.0> terminated with reason: killed last msg: redacted state: {state,#Ref<0.3717146181.405405699.170850>,couch_replicator_doc_processor,nil,<<"_replicator">>,#Ref<0.3717146181.405274627.170851>,nil,[],true} extra: [] [error] 2022-06-23T06:24:18.962005Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.811.0> -------- gen_server <0.811.0> terminated with reason: killed last msg: redacted state: {state,#Ref<0.3717146181.405405699.170850>,couch_replicator_doc_processor,nil,<<"_replicator">>,#Ref<0.3717146181.405274627.170851>,nil,[],true} extra: [] [error] 2022-06-23T06:24:18.962362Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.811.0> -------- CRASH REPORT Process (<0.811.0>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:475) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_multidb_changes,init,['Argument1']}, ancestors: [<0.692.0>,couch_replicator_sup,<0.668.0>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 181192 [error] 2022-06-23T06:24:19.005925Z couchdb@couchdb-couchdb-0.couchdb-couchdb.couchdb.svc.cluster.local <0.811.0> -------- CRASH REPORT Process (<0.811.0>) with 0 neighbors exited with reason: killed at gen_server:decode_msg/9(line:475) <= proc_lib:init_p_do_apply/3(line:226); initial_call: {couch_multidb_changes,init,['Argument1']}, ancestors: [<0.692.0>,couch_replicator_sup,<0.668.0>], message_queue_len: 0, links: [], dictionary: [], trap_exit: true, status: running, heap_size: 1598, stack_size: 28, reductions: 181192`

Version of Helm and Kubernetes: Helm Version: 3.4.0 Kubernetes Version: 1.18.3

What happened: The coordinator pod would routinely restart due to the error shown above in the logs

What you expected to happen: All pods should remain running without restarting

How to reproduce it (as minimally and precisely as possible): The issue occurs randomly and restarts with the error shown in the logs.

Anything else we need to know: We have the helm chart deployed in both a testing a production kubernetes environment and both environments demonstrate the same behaviour. The db only has a small amount of data in it and the pods do not have any cpu or memory restrictions. The pods are configured with 16Gb local-path PV's. Average memory usage is 56Mb and cpu usage below 0.03

colearendt commented 2 years ago

FWIW liveness probes can be tricky and have much discussion in the community (i.e. here and here), but the one in this chart is on by default.

You can disable it by setting livenessProbe.enabled = false: https://github.com/apache/couchdb-helm/blob/78eff8c0fc3d8524f1a5c0c27880eaf2df98a2f4/couchdb/values.yaml#L191-L197

I know little about the underlying implementation here, but it's possible that the default should be changed and livenessProbe should be turned off 🤷 Or its configuration should be improved so flapping like this is less likely. At a minimum, failureThreshold should probably be increased to 10 or so (as the latter article recommends).

cg1972 commented 2 years ago

From the logs it looks like the readiness probe is failing due to the increased time it is taking to make the check. There are times in there of 29613 and 14872 where are normal response would be < 10 ms. There appears to be something happening in couchdb that is causing the responses to slow down. When this is occurring there are no obvious cpu spikes or memory usage either.

Is there any additional debugging that can be enabled to try and capture what couchdb might be doing when it crashes? Given that this deployed service is hardly used at the moment it is concerning that this type of thing would occur when it is under no load.

colearendt commented 2 years ago

Woops - apologies for never responding here. I must admit that I am not super familiar with the internals of couchdb - I would definitely recommend turning off livenessProbe though. If the readinessProbe fails, that is fine - traffic may not get routed to the pod briefly, but the pod stays alive. Moreover, you can increase failureThreshold or increase periodSeconds or increase timeoutSeconds to make it check more often / allow for more variance.

livenessProbe should only be used when "the best thing to do for the service is to kill it." I highly doubt that that is the case if it responds a smidge slowly on occasion (during a sync or some such).