Altonymous commented 11 years ago

I'm not sure if this is a my code issue or an underlying problem.

I hopped on irc and jumped into the #rethinkdb channel. They recommended I start with giving you a shout.

Here's a gist of what I've got going on.

https://gist.github.com/Altonymous/f53f284ef1cecae7d435

christopherhesse commented 11 years ago

I will take a look at this.

christopherhesse commented 11 years ago

I can't confirm your exact error locally, but try opening session.go in rethinkgo and changing the maxIdleConnections to 100.

(You'll have to make sure go is using this new value somehow, maybe put a fmt.Println in the Connect() function to make sure you are changing the file used by your program)

Let me know if this fixes the problem, if so, it's probably a rethinkdb bug and the workaround is to raise that variable value. I may add a SetMaxIdleConnections() function.

coffeemug commented 11 years ago

We'll also prioritize https://github.com/rethinkdb/rethinkdb/issues/369 on our end.

Altonymous commented 11 years ago

Okay so I set the value as you suggested rebuilt and ran this from 4 different servers...

siege -v "http://192.168.122.31:3000/machines POST" -c 100 -r 50 -b

The problem still exists but it takes a bit longer for it to happen. It still happens within the run though. After it starts happening it doesn't stop until the service is restarted.

coffeemug commented 11 years ago

Hmm, this is one of those rare cases when I really can't tell if it's a RethinkDB bug or a Go configuration issue. We'll try to bump rethinkdb/rethinkdb#369 to see if it fixes the problem. @Altonymous -- in general, we promise that this will get resolved if it's a rethinkdb bug. It's a matter of time until issues like this are ironed out.

Altonymous commented 11 years ago

4 load generation servers all hitting a single web server at the same time... Here is the command and the graph...

$ siege -v "http://192.168.122.31:3000/machines POST" -c 10 -r 500 -b Graph: http://d.pr/i/TdJk (No Errors)

$ siege -v "http://192.168.122.31:3000/machines POST" -c 50 -r 100 -b Graph: http://d.pr/i/uJIR (Errors Started At: ~4:36)

Error: 2013/05/01 16:35:59 Handler crashed with error write tcp 192.168.122.71:28015: broken pipe 2013/05/01 16:35:59 /usr/local/go/src/pkg/runtime/panic.c 229 2013/05/01 16:35:59 /home/jenkins/web_go_rethinkdb.go 93 2013/05/01 16:35:59 /home/jenkins/web_go_rethinkdb.go 46 2013/05/01 16:35:59 /usr/local/go/src/pkg/reflect/value.go 474 2013/05/01 16:35:59 /usr/local/go/src/pkg/reflect/value.go 345 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87 2013/05/01 16:35:59 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1416 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1517 2013/05/01 16:35:59 /usr/local/go/src/pkg/net/http/server.go 1096 2013/05/01 16:35:59 /usr/local/go/src/pkg/runtime/proc.c 1223

Hope this is helping.

christopherhesse commented 11 years ago

Thanks for all the help tracking down this bug!

So my theory is that once the go driver exhausts its pool of connections, it then attempts to connect too quickly for rethinkdb to handle. If that's the case, then you should receive ~0 errors when your concurrency limit is the same as maxIdleConnections. If you set maxIdleConnections > your number of concurrent requests, does the problem still occur?

Altonymous commented 11 years ago

I set maxIdleConnections to 20480 just to give it something insane.. and it's still happening.

I should also note, I've moved to Ubuntu 12.04 server (x64) to help track this down.

Altonymous commented 11 years ago

Just to make sure I'm doing this right here's what I'm doing...

I go into $GOPATH/src/.../session.go and change the file.

Then I $ go build web_go_rethinkdb.go. I then copy that file to the web server and run it.

Does that sound right?

Altonymous commented 11 years ago

Okay because I moved to new servers I remembered something I did on my Mac that I didn't do on the ubuntu servers. I didn't update the "open files" On the web server I did, but not the rethinkdb servers. So here's some more information... (This is with the maxIdleConnections still set at 20480)

web server

$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 10240 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

rethinkdb server - before

$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 1024 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

rethinkdb server - after

$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 257264 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 10240 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 257264 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited

$ siege -v "http://192.168.122.31:3000/machines POST" -c 50 -r 100 -b No errors...

$ siege -v "http://192.168.122.31:3000/machines POST" -c 100 -r 100 -b No errors... (few connection timeouts, but no errors)

$ siege -v "http://192.168.122.31:3000/machines POST" -c 150 -r 100 -b And the errors begin again...

Again this is a single web server, with 4 load generating machines all running this command at the same time. So in essence the server is getting hit with -c 600.

Altonymous commented 11 years ago

Just to toss some more information into the pile...

If I only use 1 of the load generating servers, I start getting errors at -c 250.

$ siege -v "http://192.168.122.31:3000/machines POST" -c 225 -r 100 -b No errors...

$ siege -v "http://192.168.122.31:3000/machines POST" -c 250 -r 100 -b And the errors begin again...

Note: I try and narrow it down to within -c 25 requests.. not trying to get any more fine grained than that right now.

christopherhesse commented 11 years ago

Yeah, you're doing it correctly, another thing to try might be to limit the rate of connections to the server, I should make a branch that does this so you can try it out.

Altonymous commented 11 years ago

I'm not sure how to get a specific branch, but if you give me the go comment I'll give it a shot.

christopherhesse commented 11 years ago

I've done some testing locally, and decided that the best thing to try next would be to give you a custom version of rethinkdb with issue rethinkdb/rethinkdb#369 fixed, and see if your problem goes away. I don't have a copy of ubuntu, but it just involves changing "const int RDB_LISTEN_BACKLOG = 5;" to "const int RDB_LISTEN_BACKLOG = 128;". I asked one of the rethinkdb guys to provide a binary, so maybe that's a possibility.

For testing purposes it might be a good idea to set maxIdleConnections to the maximum number of concurrent connections you use in your tests.

coffeemug commented 11 years ago

Lobbying for a point release right now.

Tryneus commented 11 years ago

Looks like we'll be doing a point-release tomorrow containing the higher TCP accept backlog in the server, v1.4.5.

coffeemug commented 11 years ago

FYI, 1.4.5 has been released (http://rethinkdb.com/docs/install/). Let's see if it fixes the problem.

Altonymous commented 11 years ago

Things of note:

Based on my log output I only see 1 connection being created.
This seems to make it a bit harder to get the error to happen. I got to a much higher concurrent users level before seeing the error.
It seems to make a difference how many times I run the siege. The first few times it works without issues, but as I continue to run it, the error starts up. If I restart rethinkdb and the web server (because right now I don't check to see if the connection has been severed and reconnect.) the errors go for a while.

Details

Okay here's what I'm seeing... and how everything is currently configured.

On the server compiling the code...

var maxIdleConnections int = 20480

On one of the rethinkdb servers in the cluster (they should all be the same, there are 3):

[rethinkdb #2261 -- limits]
Limit                     Soft Limit           Hard Limit           Units
Max cpu time              unlimited            unlimited            seconds
Max file size             unlimited            unlimited            bytes
Max data size             unlimited            unlimited            bytes
Max stack size            8388608              unlimited            bytes
Max core file size        0                    unlimited            bytes
Max resident set          unlimited            unlimited            bytes
Max processes             15875                15875                processes
Max open files            64000                64000                files
Max locked memory         65536                65536                bytes
Max address space         unlimited            unlimited            bytes
Max file locks            unlimited            unlimited            locks
Max pending signals       15875                15875                signals
Max msgqueue size         819200               819200               bytes
Max nice priority         0                    0
Max realtime priority     0                    0
Max realtime timeout      unlimited            unlimited            us

Load Generation Server(s) Running:

siege -v "http://192.168.122.31:4000/machines/ POST" -c 250 -r 10 -b

Error I'm seeing in the web server logs:

2013/05/06 14:23:15 Handler crashed with error dial tcp 192.168.122.71:28015: too many open files
2013/05/06 14:23:15 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:23:15 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:23:15 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:23:15 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:23:15 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:23:15 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:23:15 /usr/local/go/src/pkg/runtime/proc.c 1223

Code where the error is occurring...

53 func (r *RethinkDb) insertReadings(readings []Reading) {
54   // Setup session
55   session := r.getSession()
56
57   // Setup table
58   table := rethinkgo.Table("readings")
59
60   var readingsMap []rethinkgo.Map
61   for i := 0; i < len(readings); i++ {
62     readingsMap = append(readingsMap, rethinkgo.Map{"name": readings[i].Name})
63   }
64
65   // insert the reading
66   var response rethinkgo.WriteResponse
67   err := table.Insert(readingsMap).Run(session).One(&response)
68   if err != nil {
69     fmt.Println("[ERROR] insertReadings: ", err)
70     panic(err)
71   }
72 }

Code that creates the connection:

func (r *RethinkDb) getSession() *rethinkgo.Session {
  if r.DatabaseSession == nil {

    url := fmt.Sprintf("%s:%s", globalConfiguration.Database.Host, globalConfiguration.Database.Port)
    fmt.Println("Connecting to: " + url)

    var err error
    r.DatabaseSession, err = rethinkgo.Connect(url, globalConfiguration.Database.Name)

    if err != nil {
      fmt.Println("[ERROR] getSession: ", err)
      panic(err)
    }
  }

  return r.DatabaseSession
}

Error I'm seeing in the logs when I just restart rethinkdb (this is expected, included just for completeness):

2013/05/06 14:40:42 Handler crashed with error write tcp 192.168.122.71:28015: broken pipe
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/proc.c 1223
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/panic.c 229
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 70
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/rethinkdb.go 23
2013/05/06 14:40:42 /var/lib/jenkins/jobs/api.6fusion.com/workspace/main.go 31
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 474
2013/05/06 14:40:42 /usr/local/go/src/pkg/reflect/value.go 345
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 196
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 305
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 87
2013/05/06 14:40:42 /home/jenkins/.mygo/src/github.com/hoisie/web/server.go 82
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1416
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1517
2013/05/06 14:40:42 /usr/local/go/src/pkg/net/http/server.go 1096
2013/05/06 14:40:42 /usr/local/go/src/pkg/runtime/proc.c 1223

coffeemug commented 11 years ago

Hmm, it looks like the error you're getting this time is different -- too many open files. This doesn't surprise me -- RethinkDB may create multiple open file descriptors per connection for internal communication. Your maxIdleConnections int is set to 20480, and max open files is set to 64000, which is only ~3x larger. Could you limit the total connections created to ~5,000 and see what happens? (or alternatively, increase the max number of open files)

christopherhesse commented 11 years ago

Connect() is a bit of a misnomer, it creates a session, which has a connection pool. The actual connection is created by serverConnect(). The pool behavior is similar to the go sql package.

If this problem persists for you, the solution may be for me to remove the connection pool, so that you can do your own connection pooling.

Altonymous commented 11 years ago

I added connection pooling of my own. I need to rerun the tests to see if it has any affect. You should be able to leave your connection pooling in place as long as mine is less than yours, correct? Right now I've set mine to 150 connections. But haven't run a thorough test yet.

christopherhesse commented 11 years ago

If you are calling .Connect() once for each connection that you want and only sending one request at a time over the provided session, then yeah, it should be the same thing. Your getSession() function though seems to just reuse a single session though.

Altonymous commented 11 years ago

I didn't implement connection pooling until after I ran the tests. I didn't want to introduce anything new that might affect the test results.

Here's my repository.

You should be able to see the connection pooling implementation there.

christopherhesse commented 11 years ago

Were you able to fix the "too many open files" error?

christopherhesse commented 11 years ago

I'm going to put a SetMaxIdleConnections function in and mark this issue as closed since there's been no update. If the connection pooling causes any problems in the future, I may just remove it entirely.

christopherhesse commented 11 years ago

Sounds like it's still an issue. I'll investigate the connection pooling to see if that could be related.

Altonymous commented 11 years ago

Take a look at my connection pooling before you go down that route. With my connection pooling implementation there couldn't have been that many open connections. I limited it to ~3k as you requested.

christopherhesse commented 11 years ago

Okay, I can't reproduce it locally inside a VM. It looks from your stacktrace like you are getting a "broken pipe" error on this line: defer dbPool.ReleaseConnection(session)

Is that the case?

Altonymous commented 11 years ago

Yeah that's what it looks like to me too. I branched this code before I ripped out all the rethinkdb stuff, so I haven't tested it since then. I no longer have the rethinkdb server setup either.

christopherhesse commented 11 years ago

I've removed connection pooling for now. One of the rethinkdb guys is supposed to run some network tests to see if he can reproduce this error.

elithrar commented 11 years ago

Just to add some more data here: I'm getting a similar issue. My Go test application hits about 1600/req/s and RethinkDB hits 2.2k/reads/s before the Go app stops and reports the below. I've replicated this several (15+) times and the req/s and reads/s figures are typically within +/- 5% before it crashes.

there was an error establishing a connection to the database:
dial tcp 127.0.0.1:28015: can't assign requested address

Note that during this testing, the RethinkDB web interface disconnects, but it comes back up after about 5-10s.

I'm doing my own connection pooling as well. I have a feeling some of these issues might be related to OS X and its' relatively low defaults for sockets/TCP connections however, so I'm in the process of ruling that out and will update this thread once I've done so.

elithrar commented 11 years ago

Update: I'm seeing similar issues on Ubuntu 12.04.2 server. Higher levels of concurrent connections (i.e. wrk -t 8 -c 200) see the issue come in faster.

It definitely "feels" like a socket/timeout issue (running the tests closer together causes the connection failure faster) but I don't have any hard data to support that yet.

christopherhesse commented 11 years ago

That error message seems suspiciously like it's out of ports, make sure the number of file descriptors is sufficient, say 2x the number of connections, and that you have enough ports:

https://cs.uwaterloo.ca/~brecht/servers/openfiles.html http://stackoverflow.com/questions/410616/increasing-the-maximum-number-of-tcp-ip-connections-in-linux

elithrar commented 11 years ago

@christopherhesse Looks like you're right (and I'd suspected that). Some tuning in my Ubuntu VM (more ephemeral ports & longer socket queues) seems to have resolved this, and similarly so under OS X (particularly ephemeral ports).

Thanks for the links, too: I'd done some searching but hadn't found anything as good as those two.

christopherhesse / rethinkgo

When doing some load testing I encountered an issue maintaining a connection to rethinkdb #7

web server

rethinkdb server - before

rethinkdb server - after

Things of note:

Details