codership / glb

Galera Load Balancer - a simple TCP connection proxy and load-balancing library
GNU General Public License v2.0
153 stars 51 forks source link

Connection_timeout for mysql watchdog #3

Open TomiMikola opened 10 years ago

TomiMikola commented 10 years ago

In mysql.sh (row #53 in release 1.0.1) defines no 'connect_timeout' which defaults to 0 in mysql client. With the default value watchdog does not identify properly issue with backend server crash. Setting the 'connect_timeout' to some reasonably short value gives the desired effect of dropping the backend from the pool.

One way the set the 'connect_timeout' parameter is to use OTHER_OPTIONS variable (in glbd.cfg):

OTHER_OPTIONS="-w exec:'mysql.sh --connect_timeout=1 -uglbpinger -pingerpwd'"

This should be noted in the comments of the files/glbd.cfg file. The alternative approach would be to include the connection_timeout parameter in files/mysql.sh row #53 using some variable for setting the timeout value.

ayurchen commented 10 years ago

Tomi, thanks for bringing this up. Could you clarify how default timeout of 0 prevents detecting server crash. I never had issues with that. Perhaps you mean false detection of server inavailability when the client can't connect immediately?

TomiMikola commented 10 years ago

This was the status two weeks ago:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :    2.000   1.004    N/A   -266
      10.1.4.88:3306  :    4.000   1.004    N/A   -235
      10.1.4.86:3306  :    1.000   1.011    N/A    -95
------------------------------------------------------
Destinations: 3, total connections: -596 of 10000 max

One of the servers (10.1.4.88) was crashed - all connections dropping silently (couldn't even ssh to it) Unfortunately I'm unable to reproduce the same state now.

What I was able to test was restricting connections from glb server to the backends.

No firewalls restrictions:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Dropping access from glb-stage to 10.1.4.218, no connect_timeout defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.218:3306  :    0.000    -nan    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
     10.1.4.214:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max

Firewall restricted access from glb-stage to 10.1.4.218, connect_timeout=1 defined for watchdog:

[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
     10.1.4.214:3306  :   10.000   0.000    N/A      0
     10.1.4.217:3306  :   10.000   0.000    N/A      0
------------------------------------------------------
Destinations: 2, total connections: 0 of 10000 max

So it seems watchdog works fine although for me the latter case seems more accurate when the backend is totally unreachable.

ayurchen commented 10 years ago

Tomi, thanks, this makes the issue much clearer now. Looks like a bug in watchdog backend, I'll see if it can be fixed there.

The negative connection count though looks far more disturbing... It would be good if there were a way to reproduce it.

TomiMikola commented 10 years ago

We got again this negative connection issue today. Status looked like this:

[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
        Address       :   weight   usage    map  conns
      10.1.4.87:3306  :  100.000   1.004    N/A   -275
      10.1.4.88:3306  :   50.000   1.002    N/A   -464
      10.1.4.86:3306  :    1.000   1.002    N/A   -488
------------------------------------------------------
Destinations: 3, total connections: -1227 of 10000 max

All the backends were functioning normally without anything in logs. Some sort of leakage in the wathdog or the glbd itself?

TomiMikola commented 10 years ago

anything ideas to help debugging with this?

Edit 2014-01-03 22:06: By looking at the application log entries I see a lot of "SQLSTATE[08004] [1040] Too many connections" errors some minutes before the crash. And then dozens of deadlocks with the message "SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction" The max_connections variable was set to 500 on each of the three backends.

ayurchen commented 10 years ago

not right away :(

On Fri, Jan 3, 2014 at 7:29 PM, TomiMikola notifications@github.com wrote:

anything ideas to help debugging with this?

— Reply to this email directly or view it on GitHubhttps://github.com/codership/glb/issues/3#issuecomment-31538240 .