Open Altonymous opened 11 years ago
I will take a look at this.
I can't confirm your exact error locally, but try opening session.go in rethinkgo and changing the maxIdleConnections to 100.
(You'll have to make sure go is using this new value somehow, maybe put a fmt.Println in the Connect() function to make sure you are changing the file used by your program)
Let me know if this fixes the problem, if so, it's probably a rethinkdb bug and the workaround is to raise that variable value. I may add a SetMaxIdleConnections() function.
We'll also prioritize https://github.com/rethinkdb/rethinkdb/issues/369 on our end.
Okay so I set the value as you suggested rebuilt and ran this from 4 different servers...
siege -v "http://192.168.122.31:3000/machines POST" -c 100 -r 50 -b
The problem still exists but it takes a bit longer for it to happen. It still happens within the run though. After it starts happening it doesn't stop until the service is restarted.
Hmm, this is one of those rare cases when I really can't tell if it's a RethinkDB bug or a Go configuration issue. We'll try to bump rethinkdb/rethinkdb#369 to see if it fixes the problem. @Altonymous -- in general, we promise that this will get resolved if it's a rethinkdb bug. It's a matter of time until issues like this are ironed out.
4 load generation servers all hitting a single web server at the same time... Here is the command and the graph...
$ siege -v "http://192.168.122.31:3000/machines POST" -c 10 -r 500 -b Graph: http://d.pr/i/TdJk (No Errors)
$ siege -v "http://192.168.122.31:3000/machines POST" -c 50 -r 100 -b Graph: http://d.pr/i/uJIR (Errors Started At: ~4:36)
Error: 2013/05/01 16:35:59 Handler crashed with error write tcp 192.168.122.71:28015: broken pipe 2013/05/01 16:35:59 /usr/local/go/src/pkg/runtime/panic.c 229 2013/05/01 16:35:59 /home/jenkins/web_go_rethinkdb.go 93 2013/05/01 16:35:59 /home/jenkins/web_go_rethinkdb.go 46 2013/05/01 16:35:59 /usr/local/go/src/pkg/reflect/value.go 474 2013/05/01 16:35:59 /usr/local/go/src/pkg/reflect/value.go 345 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1416 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1517 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1096 2013/05/01 16:35:59 /usr/local/go/src/pkg/runtime/proc.c 1223
Hope this is helping.
Thanks for all the help tracking down this bug!
So my theory is that once the go driver exhausts its pool of connections, it then attempts to connect too quickly for rethinkdb to handle. If that's the case, then you should receive ~0 errors when your concurrency limit is the same as maxIdleConnections. If you set maxIdleConnections > your number of concurrent requests, does the problem still occur?
I set maxIdleConnections to 20480 just to give it something insane.. and it's still happening.
I should also note, I've moved to Ubuntu 12.04 server (x64) to help track this down.
Just to make sure I'm doing this right here's what I'm doing...
I go into $GOPATH/src/.../session.go and change the file.
Then I $ go build web_go_rethinkdb.go. I then copy that file to the web server and run it.
Does that sound right?
Okay because I moved to new servers I remembered something I did on my Mac that I didn't do on the ubuntu servers. I didn't update the "open files" On the web server I did, but not the rethinkdb servers. So here's some more information... (This is with the maxIdleConnections still set at 20480)
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 10240 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 10240 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited
$ siege -v "http://192.168.122.31:3000/machines POST" -c 50 -r 100 -b No errors...
$ siege -v "http://192.168.122.31:3000/machines POST" -c 100 -r 100 -b No errors... (few connection timeouts, but no errors)
$ siege -v "http://192.168.122.31:3000/machines POST" -c 150 -r 100 -b And the errors begin again...
Again this is a single web server, with 4 load generating machines all running this command at the same time. So in essence the server is getting hit with -c 600.
Just to toss some more information into the pile...
If I only use 1 of the load generating servers, I start getting errors at -c 250.
$ siege -v "http://192.168.122.31:3000/machines POST" -c 225 -r 100 -b No errors...
$ siege -v "http://192.168.122.31:3000/machines POST" -c 250 -r 100 -b And the errors begin again...
Note: I try and narrow it down to within -c 25 requests.. not trying to get any more fine grained than that right now.
Yeah, you're doing it correctly, another thing to try might be to limit the rate of connections to the server, I should make a branch that does this so you can try it out.
I'm not sure how to get a specific branch, but if you give me the go comment I'll give it a shot.
I've done some testing locally, and decided that the best thing to try next would be to give you a custom version of rethinkdb with issue rethinkdb/rethinkdb#369 fixed, and see if your problem goes away. I don't have a copy of ubuntu, but it just involves changing "const int RDB_LISTEN_BACKLOG = 5;" to "const int RDB_LISTEN_BACKLOG = 128;". I asked one of the rethinkdb guys to provide a binary, so maybe that's a possibility.
For testing purposes it might be a good idea to set maxIdleConnections to the maximum number of concurrent connections you use in your tests.
Lobbying for a point release right now.
Looks like we'll be doing a point-release tomorrow containing the higher TCP accept backlog in the server, v1.4.5.
FYI, 1.4.5 has been released (http://rethinkdb.com/docs/install/). Let's see if it fixes the problem.
Okay here's what I'm seeing... and how everything is currently configured.
On the server compiling the code...
var maxIdleConnections int = 20480
On one of the rethinkdb servers in the cluster (they should all be the same, there are 3):
[rethinkdb #2261 -- limits]
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 15875 15875 processes
Max open files 64000 64000 files
Max locked memory 65536 65536 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 15875 15875 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
Load Generation Server(s) Running:
siege -v "http://192.168.122.31:4000/machines/ POST" -c 250 -r 10 -b
Error I'm seeing in the web server logs:
2013/05/06 14:23:15 Handler crashed with error dial tcp 192.168.122.71:28015: too many open files
2013/05/06 14:23:15 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:23:15 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:23:15 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:23:15 /usr/local/go/src/pkg/runtime/proc.c 1223
Code where the error is occurring...
53 func (r *RethinkDb) insertReadings(readings []Reading) {
54 // Setup session
55 session := r.getSession()
56
57 // Setup table
58 table := rethinkgo.Table("readings")
59
60 var readingsMap []rethinkgo.Map
61 for i := 0; i < len(readings); i++ {
62 readingsMap = append(readingsMap, rethinkgo.Map{"name": readings[i].Name})
63 }
64
65 // insert the reading
66 var response rethinkgo.WriteResponse
67 err := table.Insert(readingsMap).Run(session).One(&response)
68 if err != nil {
69 fmt.Println("[ERROR] insertReadings: ", err)
70 panic(err)
71 }
72 }
Code that creates the connection:
func (r *RethinkDb) getSession() *rethinkgo.Session {
if r.DatabaseSession == nil {
url := fmt.Sprintf("%s:%s", globalConfiguration.Database.Host, globalConfiguration.Database.Port)
fmt.Println("Connecting to: " + url)
var err error
r.DatabaseSession, err = rethinkgo.Connect(url, globalConfiguration.Database.Name)
if err != nil {
fmt.Println("[ERROR] getSession: ", err)
panic(err)
}
}
return r.DatabaseSession
}
Error I'm seeing in the logs when I just restart rethinkdb (this is expected, included just for completeness):
2013/05/06 14:40:42 Handler crashed with error write tcp 192.168.122.71:28015: broken pipe
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/proc.c 1223
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/proc.c 1223
Hmm, it looks like the error you're getting this time is different -- too many open files
. This doesn't surprise me -- RethinkDB may create multiple open file descriptors per connection for internal communication. Your maxIdleConnections int
is set to 20480, and max open files is set to 64000
, which is only ~3x larger. Could you limit the total connections created to ~5,000 and see what happens? (or alternatively, increase the max number of open files)
Connect() is a bit of a misnomer, it creates a session, which has a connection pool. The actual connection is created by serverConnect(). The pool behavior is similar to the go sql package.
If this problem persists for you, the solution may be for me to remove the connection pool, so that you can do your own connection pooling.
I added connection pooling of my own. I need to rerun the tests to see if it has any affect. You should be able to leave your connection pooling in place as long as mine is less than yours, correct? Right now I've set mine to 150 connections. But haven't run a thorough test yet.
If you are calling .Connect() once for each connection that you want and only sending one request at a time over the provided session, then yeah, it should be the same thing. Your getSession() function though seems to just reuse a single session though.
I didn't implement connection pooling until after I ran the tests. I didn't want to introduce anything new that might affect the test results.
Here's my repository.
You should be able to see the connection pooling implementation there.
Were you able to fix the "too many open files" error?
I'm going to put a SetMaxIdleConnections function in and mark this issue as closed since there's been no update. If the connection pooling causes any problems in the future, I may just remove it entirely.
Sounds like it's still an issue. I'll investigate the connection pooling to see if that could be related.
Take a look at my connection pooling before you go down that route. With my connection pooling implementation there couldn't have been that many open connections. I limited it to ~3k as you requested.
Okay, I can't reproduce it locally inside a VM. It looks from your stacktrace like you are getting a "broken pipe" error on this line: defer dbPool.ReleaseConnection(session)
Is that the case?
Yeah that's what it looks like to me too. I branched this code before I ripped out all the rethinkdb stuff, so I haven't tested it since then. I no longer have the rethinkdb server setup either.
I've removed connection pooling for now. One of the rethinkdb guys is supposed to run some network tests to see if he can reproduce this error.
Just to add some more data here: I'm getting a similar issue. My Go test application hits about 1600/req/s and RethinkDB hits 2.2k/reads/s before the Go app stops and reports the below. I've replicated this several (15+) times and the req/s and reads/s figures are typically within +/- 5% before it crashes.
there was an error establishing a connection to the database:
dial tcp 127.0.0.1:28015: can't assign requested address
Note that during this testing, the RethinkDB web interface disconnects, but it comes back up after about 5-10s.
I'm doing my own connection pooling as well. I have a feeling some of these issues might be related to OS X and its' relatively low defaults for sockets/TCP connections however, so I'm in the process of ruling that out and will update this thread once I've done so.
Update: I'm seeing similar issues on Ubuntu 12.04.2 server. Higher levels of concurrent connections (i.e. wrk -t 8 -c 200
) see the issue come in faster.
It definitely "feels" like a socket/timeout issue (running the tests closer together causes the connection failure faster) but I don't have any hard data to support that yet.
That error message seems suspiciously like it's out of ports, make sure the number of file descriptors is sufficient, say 2x the number of connections, and that you have enough ports:
https://cs.uwaterloo.ca/~brecht/servers/openfiles.html http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux
@christopherhesse Looks like you're right (and I'd suspected that). Some tuning in my Ubuntu VM (more ephemeral ports & longer socket queues) seems to have resolved this, and similarly so under OS X (particularly ephemeral ports).
Thanks for the links, too: I'd done some searching but hadn't found anything as good as those two.
I'm not sure if this is a my code issue or an underlying problem.
I hopped on irc and jumped into the #rethinkdb channel. They recommended I start with giving you a shout.
Here's a gist of what I've got going on.
https://gist.github.com/Altonymous/f53f284ef1cecae7d435