Orange-OpenSource / galera-operator

Galera Operator automates tasks for managing a Galera cluster in Kubernetes
Apache License 2.0
34 stars 18 forks source link

Cluster crashing #15

Closed maxshilov closed 4 years ago

maxshilov commented 4 years ago

When I try to run sysbench tests against cluster created with example-manifests MySQL is crashing

Console with Sysbench output:

sysbench --db-debug=on --db-driver=mysql --mysql-user=root --mysql_password=test --mysql-db=test_db --mysql-host=10.109.107.2 ./sysbench/src/lua/oltp_write_only.lua run
sysbench 1.1.0-bbee5d5 (using bundled LuaJIT 2.1.0-beta3)

Running the test with following options:
Number of threads: 1
Initializing random number generator from current time

Initializing worker threads...

Threads started!

FATAL: mysql_drv_query() returned error 2013 (Lost connection to MySQL server during query) for query 'BEGIN'
FATAL: `thread_run' function failed: /usr/local/share/sysbench/oltp_common.lua:412: SQL error, errno = 2013, state = 'HY000': Lost connection to MySQL server during query

logs gathered with kubectl logs -l galera.v1beta2.sql.databases/galera-namespace=galera -f -c galera -n galera:

2020-10-20 19:03:49 3 [Note] WSREP:     id: 4a7546ae-1303-11eb-b229-7a289af995ab name: gal-g8jhx
2020-10-20 19:03:49 3 [Note] WSREP:     id: 6475a4d4-1303-11eb-907d-dfe54094ec8f name: gal-kjkxd
2020-10-20 19:03:49 3 [Note] WSREP: =================================================
2020-10-20 19:03:49 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2020-10-20 19:03:50 0 [Note] WSREP: Member 0.0 (gal-zsfp6) requested state transfer from '*any*'. Selected 1.0 (gal-g8jhx)(SYNCED) as donor.
2020-10-20 19:03:51 0 [Note] WSREP: (6475a4d4, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-10-20 19:04:02 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 30)
2020-10-20 19:04:02 1 [Note] WSREP: Server gal-g8jhx synced with group
2020-10-20 19:04:02 1 [Note] WSREP: Server status change joined -> synced
2020-10-20 19:04:02 1 [Note] WSREP: Synchronized with group, ready for connections
2020-10-20 19:04:02 1 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
WSREP_SST: [INFO] Total time on donor: 0 seconds (20201020 19:04:02.883)
WSREP_SST: [INFO] Cleaning up temporary directories (20201020 19:04:02.887)
2020-10-20 19:04:04 0 [Note] WSREP: async IST sender served
2020-10-20 19:04:04 0 [Note] WSREP: 0.0 (gal-zsfp6): State transfer from 1.0 (gal-g8jhx) complete.
2020-10-20 19:04:04 0 [Note] WSREP: Member 0.0 (gal-zsfp6) synced with group.
2020-10-20 19:04:02 0 [Note] WSREP: 1.0 (gal-g8jhx): State transfer to 0.0 (gal-zsfp6) complete.
2020-10-20 19:04:02 0 [Note] WSREP: Member 1.0 (gal-g8jhx) synced with group.
2020-10-20 19:04:04 0 [Note] WSREP: 0.0 (gal-zsfp6): State transfer from 1.0 (gal-g8jhx) complete.
2020-10-20 19:04:04 0 [Note] WSREP: Member 0.0 (gal-zsfp6) synced with group.
2020-10-20 19:04:04 0 [Note] WSREP: Receiving IST...100.0% (5/5 events) complete.
2020-10-20 19:04:04 3 [Note] WSREP: IST received: 364f2641-1303-11eb-b68f-c32d75c48920:30
2020-10-20 19:04:04 0 [Note] WSREP: 0.0 (gal-zsfp6): State transfer from 1.0 (gal-g8jhx) complete.
2020-10-20 19:04:04 0 [Note] WSREP: Shifting JOINER -> JOINED (TO: 30)
2020-10-20 19:04:04 0 [Note] WSREP: Member 0.0 (gal-zsfp6) synced with group.
2020-10-20 19:04:04 0 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 30)
2020-10-20 19:04:04 3 [Note] WSREP: Server gal-zsfp6 synced with group
2020-10-20 19:04:04 3 [Note] WSREP: Server status change joined -> synced
2020-10-20 19:04:04 3 [Note] WSREP: Synchronized with group, ready for connections
2020-10-20 19:04:04 3 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

mysqld: /home/buildbot/buildbot/build/mariadb-10.4.2/wsrep-lib/src/client_state.cpp:108: int wsrep::client_state::before_command(): Assertion `transaction_.active() == false || (transaction_.state() == wsrep::transaction::s_executing || transaction_.state() == wsrep::transaction::s_aborted || (transaction_.state() == wsrep::transaction::s_must_abort && server_state_.rollback_mode() == wsrep::server_state::rm_async))' failed.
201020 19:15:05 [ERROR] mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see https://mariadb.com/kb/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

Server version: 10.4.2-MariaDB-1:10.4.2+maria~bionic
key_buffer_size=134217728
read_buffer_size=131072
max_used_connections=1
max_threads=402
thread_count=10
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1015316 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fe24c00ae78
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fe268305dd8 thread_stack 0x49000
mysqld(my_print_stacktrace+0x2e)[0x55af6850161e]
mysqld(handle_fatal_signal+0x5a5)[0x55af67f8a015]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x12890)[0x7fe2899db890]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7fe288cebe97]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7fe288ced801]
/lib/x86_64-linux-gnu/libc.so.6(+0x3039a)[0x7fe288cdd39a]
/lib/x86_64-linux-gnu/libc.so.6(+0x30412)[0x7fe288cdd412]
mysqld(_ZN5wsrep12client_state14before_commandEv+0x503)[0x55af6856ed63]
mysqld(_Z10do_commandP3THD+0x1da)[0x55af67d99ada]
mysqld(_Z24do_handle_one_connectionP7CONNECT+0x2c2)[0x55af67e6a2a2]
mysqld(handle_one_connection+0x3d)[0x55af67e6a36d]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76db)[0x7fe2899d06db]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f)[0x7fe288dce88f]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0x0):
Connection ID (thread ID): 557
Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersection=off,engine_condition_pushdown=off,index_condition_pushdown=on,derived_merge=on,derived_with_keys=on,firstmatch=on,loosescan=on,materialization=on,in_to_exists=on,semijoin=on,partial_match_rowid_merge=on,partial_match_table_scan=on,subquery_cache=on,mrr=off,mrr_cost_based=off,mrr_sort_keys=off,outer_join_with_cache=on,semijoin_with_cache=on,join_cache_incremental=on,join_cache_hashed=on,join_cache_bka=on,optimize_join_buffer_size=off,table_elimination=on,extended_keys=on,exists_to_in=on,orderby_uses_equalities=on,condition_pushdown_for_derived=on,split_materialized=on,condition_pushdown_for_subquery=on

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
Fatal signal 11 while backtracing
2020-10-20 19:15:05 0 [Note] WSREP: (6475a4d4, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.17.0.5:4567
2020-10-20 19:15:05 0 [Note] WSREP: (3c45d8de, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.17.0.5:4567
2020-10-20 19:15:06 0 [Note] WSREP: (3c45d8de, 'tcp://0.0.0.0:4567') reconnecting to 4a7546ae (tcp://172.17.0.5:4567), attempt 0
2020-10-20 19:15:07 0 [Note] WSREP: (6475a4d4, 'tcp://0.0.0.0:4567') reconnecting to 4a7546ae (tcp://172.17.0.5:4567), attempt 0
2020-10-20 19:15:10 0 [Note] WSREP: evs::proto(3c45d8de, OPERATIONAL, view_id(REG,3c45d8de,5)) suspecting node: 4a7546ae
2020-10-20 19:15:10 0 [Note] WSREP: evs::proto(3c45d8de, OPERATIONAL, view_id(REG,3c45d8de,5)) suspected node without join message, declaring inactive
2020-10-20 19:15:11 0 [Note] WSREP: (3c45d8de, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.17.0.5:4567 timed out, no messages seen in PT3S
2020-10-20 19:15:11 0 [Note] WSREP: evs::proto(6475a4d4, GATHER, view_id(REG,3c45d8de,5)) suspecting node: 4a7546ae
2020-10-20 19:15:11 0 [Note] WSREP: evs::proto(6475a4d4, GATHER, view_id(REG,3c45d8de,5)) suspected node without join message, declaring inactive
2020-10-20 19:15:11 0 [Note] WSREP: (6475a4d4, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://172.17.0.5:4567 timed out, no messages seen in PT3S
2020-10-20 19:15:11 0 [Note] WSREP: declaring 3c45d8de at tcp://172.17.0.3:4567 stable
2020-10-20 19:15:11 0 [Note] WSREP: declaring 6475a4d4 at tcp://172.17.0.6:4567 stable
2020-10-20 19:15:11 0 [Note] WSREP: Node 3c45d8de state prim
2020-10-20 19:15:11 0 [Note] WSREP: Node 3c45d8de state prim
2020-10-20 19:15:11 0 [Note] WSREP: view(view_id(PRIM,3c45d8de,6) memb {
        3c45d8de,0
        6475a4d4,0
} joined {
} left {
} partitioned {
        4a7546ae,0
})
2020-10-20 19:15:11 0 [Note] WSREP: save pc into disk
2020-10-20 19:15:11 0 [Note] WSREP: view(view_id(PRIM,3c45d8de,6) memb {
        3c45d8de,0
        6475a4d4,0
} joined {
} left {
} partitioned {
        4a7546ae,0
})
2020-10-20 19:15:11 0 [Note] WSREP: save pc into disk
2020-10-20 19:15:11 0 [Note] WSREP: forgetting 4a7546ae (tcp://172.17.0.5:4567)
2020-10-20 19:15:11 0 [Note] WSREP: (3c45d8de, 'tcp://0.0.0.0:4567') turning message relay requesting off
2020-10-20 19:15:11 0 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 2
2020-10-20 19:15:11 0 [Note] WSREP: STATE_EXCHANGE: sent state UUID: 9370b8ea-1308-11eb-8fc3-7a2570d6c9f7
2020-10-20 19:15:11 0 [Note] WSREP: STATE EXCHANGE: sent state msg: 9370b8ea-1308-11eb-8fc3-7a2570d6c9f7
2020-10-20 19:15:11 0 [Note] WSREP: STATE EXCHANGE: got state msg: 9370b8ea-1308-11eb-8fc3-7a2570d6c9f7 from 0 (gal-zsfp6)
2020-10-20 19:15:11 0 [Note] WSREP: STATE EXCHANGE: got state msg: 9370b8ea-1308-11eb-8fc3-7a2570d6c9f7 from 1 (gal-kjkxd)
2020-10-20 19:15:11 0 [Note] WSREP: Quorum results:
        version    = 5,
        component  = PRIMARY,
        conf_id    = 5,
        members    = 2/2 (joined/total),
        act_id     = 31,
        last_appl. = 0,
        protocols  = 1/10/4 (gcs/repl/appl),
        vote policy= 0,
        group UUID = 364f2641-1303-11eb-b68f-c32d75c48920
2020-10-20 19:15:11 0 [Note] WSREP: Writing down CC checksum: d311c5ac d07ea68d 0b0c95ae e22db27a at offset 184
2020-10-20 19:15:11 0 [Note] WSREP: Flow-control interval: [23, 23]
2020-10-20 19:15:11 0 [Note] WSREP: Trying to continue unpaused monitor
2020-10-20 19:15:11 3 [Note] WSREP: ####### processing CC 32, local, ordered
2020-10-20 19:15:11 3 [Note] WSREP: ####### drain monitors upto 31
2020-10-20 19:15:11 3 [Note] WSREP: REPL Protocols: 10 (5, 3)
2020-10-20 19:15:11 3 [Note] WSREP: ####### My UUID: 3c45d8de-1303-11eb-b4c2-d27bc643c6ac
2020-10-20 19:15:11 3 [Note] WSREP: ####### ST not required
2020-10-20 19:15:11 3 [Note] WSREP: Skipping cert index reset`
maxshilov commented 4 years ago

looks like I'm hitting this bug - https://github.com/codership/wsrep-lib/issues/135

maxshilov commented 4 years ago

corresponding MariaDB bug - https://jira.mariadb.org/browse/MDEV-18571

Issue fixed by changing cluster image to 10.4.12