Every few days, one of my servers issues a kernel log message:
TCP: request_sock_TCP: Possible SYN flooding on port 11211. Sending cookies. Check SNMP counters.
Mostly processing continues after this, but sometimes the entire server is unresponsive for minutes.
In this environment we have 10 Go programs using gomemcache hitting one memcached server, and each Go program has 60 goroutines that will call through this library. So I expected a maximum of 600 connections at a time.
I have seen the SYN flooding message at the default memcached connection backlog of 1024, and also after I raised it to 4096.
From inspection of logs, packet traces, etc., I have formed the impression that some glitch in processing or network causes timeout errors (at the default of 100ms), which then cause gomemcache to dial new connections. 60 goroutines waiting 100ms each to dial gives 600 new connections dialed each second, per process.
If the dial attempts are not being discarded on the other end of the wire, then I think it can quickly go over the backlog limit.
I wondered if gomemcache should have a rate-limiter on dial()? I would prefer gomemcache to fail quickly rather than raising the timeout to slow it down.
Any other insight would be valued.
The only related issue I could see here is #108 ; interestingly we are both running the same system.
Every few days, one of my servers issues a kernel log message:
Mostly processing continues after this, but sometimes the entire server is unresponsive for minutes.
In this environment we have 10 Go programs using
gomemcache
hitting one memcached server, and each Go program has 60 goroutines that will call through this library. So I expected a maximum of 600 connections at a time.I have seen the SYN flooding message at the default memcached connection backlog of 1024, and also after I raised it to 4096.
From inspection of logs, packet traces, etc., I have formed the impression that some glitch in processing or network causes timeout errors (at the default of 100ms), which then cause
gomemcache
to dial new connections. 60 goroutines waiting 100ms each to dial gives 600 new connections dialed each second, per process.If the dial attempts are not being discarded on the other end of the wire, then I think it can quickly go over the backlog limit.
I wondered if
gomemcache
should have a rate-limiter ondial()
? I would prefergomemcache
to fail quickly rather than raising the timeout to slow it down. Any other insight would be valued.The only related issue I could see here is #108 ; interestingly we are both running the same system.