Open TomiMikola opened 10 years ago
Tomi, thanks for bringing this up. Could you clarify how default timeout of 0 prevents detecting server crash. I never had issues with that. Perhaps you mean false detection of server inavailability when the client can't connect immediately?
This was the status two weeks ago:
[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
Address : weight usage map conns
10.1.4.87:3306 : 2.000 1.004 N/A -266
10.1.4.88:3306 : 4.000 1.004 N/A -235
10.1.4.86:3306 : 1.000 1.011 N/A -95
------------------------------------------------------
Destinations: 3, total connections: -596 of 10000 max
One of the servers (10.1.4.88) was crashed - all connections dropping silently (couldn't even ssh to it) Unfortunately I'm unable to reproduce the same state now.
What I was able to test was restricting connections from glb server to the backends.
No firewalls restrictions:
[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
Address : weight usage map conns
10.1.4.218:3306 : 10.000 0.000 N/A 0
10.1.4.217:3306 : 10.000 0.000 N/A 0
10.1.4.214:3306 : 10.000 0.000 N/A 0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max
Dropping access from glb-stage to 10.1.4.218, no connect_timeout defined for watchdog:
[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
Address : weight usage map conns
10.1.4.218:3306 : 0.000 -nan N/A 0
10.1.4.217:3306 : 10.000 0.000 N/A 0
10.1.4.214:3306 : 10.000 0.000 N/A 0
------------------------------------------------------
Destinations: 3, total connections: 0 of 10000 max
Firewall restricted access from glb-stage to 10.1.4.218, connect_timeout=1 defined for watchdog:
[root@glb-stage ~]# service glbd status
Router:
------------------------------------------------------
Address : weight usage map conns
10.1.4.214:3306 : 10.000 0.000 N/A 0
10.1.4.217:3306 : 10.000 0.000 N/A 0
------------------------------------------------------
Destinations: 2, total connections: 0 of 10000 max
So it seems watchdog works fine although for me the latter case seems more accurate when the backend is totally unreachable.
Tomi, thanks, this makes the issue much clearer now. Looks like a bug in watchdog backend, I'll see if it can be fixed there.
The negative connection count though looks far more disturbing... It would be good if there were a way to reproduce it.
We got again this negative connection issue today. Status looked like this:
[root@glb1 ~]# service glbd status
Router:
------------------------------------------------------
Address : weight usage map conns
10.1.4.87:3306 : 100.000 1.004 N/A -275
10.1.4.88:3306 : 50.000 1.002 N/A -464
10.1.4.86:3306 : 1.000 1.002 N/A -488
------------------------------------------------------
Destinations: 3, total connections: -1227 of 10000 max
All the backends were functioning normally without anything in logs. Some sort of leakage in the wathdog or the glbd itself?
anything ideas to help debugging with this?
Edit 2014-01-03 22:06: By looking at the application log entries I see a lot of "SQLSTATE[08004] [1040] Too many connections" errors some minutes before the crash. And then dozens of deadlocks with the message "SQLSTATE[HY000]: General error: 1205 Lock wait timeout exceeded; try restarting transaction" The max_connections variable was set to 500 on each of the three backends.
not right away :(
On Fri, Jan 3, 2014 at 7:29 PM, TomiMikola notifications@github.com wrote:
anything ideas to help debugging with this?
— Reply to this email directly or view it on GitHubhttps://github.com/codership/glb/issues/3#issuecomment-31538240 .
In mysql.sh (row #53 in release 1.0.1) defines no 'connect_timeout' which defaults to 0 in mysql client. With the default value watchdog does not identify properly issue with backend server crash. Setting the 'connect_timeout' to some reasonably short value gives the desired effect of dropping the backend from the pool.
One way the set the 'connect_timeout' parameter is to use OTHER_OPTIONS variable (in glbd.cfg):
This should be noted in the comments of the files/glbd.cfg file. The alternative approach would be to include the connection_timeout parameter in files/mysql.sh row #53 using some variable for setting the timeout value.