centrifugal / centrifugo

Scalable real-time messaging server in a language-agnostic way. Self-hosted alternative to Pubnub, Pusher, Ably. Set up once and forever.
https://centrifugal.dev
Apache License 2.0
8.38k stars 594 forks source link

http: panic serving [::1]:57351: runtime error: invalid memory address or nil pointer dereference #74

Closed x1unix closed 8 years ago

x1unix commented 8 years ago

OS: Centos 7

Centrufugo version: 1.4.0

After request to centrifugo, service crashes with error:

http: panic serving [::1]:57351: runtime error: invalid memory address or nil pointer dereference
goroutine 18 [running]:
net/http.(*conn).serve.func1(0x18af8240, 0xf764a130, 0x18aec1d8)
    /Users/fz/projects/go15/src/net/http/server.go:1287 +0xa2
sync/atomic.AddUint64(0x18a8c454, 0x1, 0x0, 0x806559a, 0x18aeabd0)
    /Users/fz/projects/go15/src/sync/atomic/asm_386.s:112 +0xc
github.com/centrifugal/centrifugo/libcentrifugo.(*metricCounter).Add(0x18a8c454, 0x1, 0x0, 0x9, 0x18aeabc0)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/metrics.go:129 +0x89
github.com/centrifugal/centrifugo/libcentrifugo.(*metricCounter).Inc(0x18a8c454, 0x18aeabb0, 0xf778a25c)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/metrics.go:118 +0x31
github.com/centrifugal/centrifugo/libcentrifugo.(*Application).APIHandler(0x18a14900, 0xf764a218, 0x18b56100, 0x18afa310)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/handlers.go:318 +0x43
github.com/centrifugal/centrifugo/libcentrifugo.(*Application).APIHandler-fm(0xf764a218, 0x18b56100, 0x18afa310)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/handlers.go:103 +0x38
net/http.HandlerFunc.ServeHTTP(0x18b171c8, 0xf764a218, 0x18b56100, 0x18afa310)
    /Users/fz/projects/go15/src/net/http/server.go:1422 +0x34
github.com/centrifugal/centrifugo/libcentrifugo.(*Application).WrapShutdown.func1(0xf764a218, 0x18b56100, 0x18afa310)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/handlers.go:470 +0x9d
net/http.HandlerFunc.ServeHTTP(0x18b0b190, 0xf764a218, 0x18b56100, 0x18afa310)
    /Users/fz/projects/go15/src/net/http/server.go:1422 +0x34
github.com/centrifugal/centrifugo/libcentrifugo.(*Application).Logged.func1(0xf764a218, 0x18b56100, 0x18afa310)
    /private/var/www/different/go/gopath/src/github.com/centrifugal/centrifugo/libcentrifugo/handlers.go:486 +0x105
net/http.HandlerFunc.ServeHTTP(0x18b0b1a0, 0xf764a218, 0x18b56100, 0x18afa310)
    /Users/fz/projects/go15/src/net/http/server.go:1422 +0x34
net/http.(*ServeMux).ServeHTTP(0x18b184a0, 0xf764a218, 0x18b56100, 0x18afa310)
    /Users/fz/projects/go15/src/net/http/server.go:1699 +0x133
net/http.serverHandler.ServeHTTP(0x18b58540, 0xf764a218, 0x18b56100, 0x18afa310)
    /Users/fz/projects/go15/src/net/http/server.go:1862 +0x156
net/http.(*conn).serve(0x18af8240)
    /Users/fz/projects/go15/src/net/http/server.go:1361 +0xc05
created by net/http.(*Server).Serve
    /Users/fz/projects/go15/src/net/http/server.go:1910 +0x343
^C[I]: 2016/02/25 12:50:06 Signal received: interrupt
[I]: 2016/02/25 12:50:06 Shutting down

Config.json:

{
"secret": "secret",
      "publish": false,
      "watch": true,
      "presence": true,
      "join_leave": false,
  "insecure_api" : true,
  "history_size": 10,
  "history_lifetime": 240,
  "log_level": "debug"
}
banks commented 8 years ago

Thanks for the report. Has this happened regularly for you or just once?

Can you reproduce reliably? Does it happen right after startup or at some "random" time after it's be working fine for a while?

Looks like something racey in the metrics updating which we added recently. If it's a common occurrence and affecting your service availability, you can downgrade to 1.3.3 - the only major change was in metrics so unless you are relying on new metrics gathering it won't even be noticeable.

@FZambia Not sure when I'll have time to look at this as I have a few other priorities. I'll admit I didn't run race detector though.

FZambia commented 8 years ago

@odin3 thanks a lot for the report! Please, give us answers for questions @banks asked.

@banks it was long time ago I last ran Centrifugo under race detector. Looks like there are some problems: https://gist.github.com/FZambia/261b1b15d7f059035f4f

FZambia commented 8 years ago

One more: https://gist.github.com/FZambia/88bc50efec3728ea7f85 seems very similar to panic above

FZambia commented 8 years ago

@banks Paul, please look at pr above - not sure I did it in optimal way but races gone away. Maybe you see a better way?

FZambia commented 8 years ago

@odin3 I'll make new release with fix of race condition that most probably caused that panic later today. But just to be more sure we still need some answers on questions above from you - because if this happens reliably and you can easily reproduce this - our theory ruins:)

banks commented 8 years ago

Seems good for now. Having seen this I think my metrics design was probably not the cleanest - I was optimising more for not having to hard code every metrics field in many places and have many different stir ts all holding different counts for same named metric. You're PR seems to please race detector and the traces seem to match the issue here so it's a reasonable assumption you fixed this. We can review cleansing up the metros further elsewhere.

x1unix commented 8 years ago

@banks it happens after every API request to centrifugo. I use Centrifugo PHP api. I've downgraded to 1.3.3 and that's was a single solution at the moment.

FZambia commented 8 years ago

@odin3 seems this caused because you are using 32-bit architecture. Unfortunately I can't proceed with this today because it's too late in Moscow, but this issue is the highest priority now - so will continue tomorrow.

FZambia commented 8 years ago

@odin3 we were able to reliably reproduce that panic on 32-bit Centos. The problem was not in race conditions we assumed initially but in bug on some architectures (see this link). Now it should be fixed, just made new release (v1.4.1) that must resolve this issue. Please, try it out.

x1unix commented 8 years ago

@FZambia I'm using 64-bit architecture (x86_64) on CentOS: CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz OS: CentOS Linux release 7.2.1511 (Core) (x86_64)

x1unix commented 8 years ago

@FZambia looks like v1.4.1 is working without crash, I'll test it now.

banks commented 8 years ago

@odin3 Your original panic shows asm_386.s which shows you were using the i386 (i.e. 32 bit) binary. So even if your actual hardware is 64 bit AND your OS version is 64 bit (you can run 32 bit linux distros on modern processors and some do for example on small VPS to save memory on pointers), if you were using the i386 binary by mistake it still would have crashed due to this bug.

That's because the assembly code in sync/atomic explicitly causes that crash for unaligned access on i386 builds.

But in general, if you are running 64 bit OS on 64 bit machine you should be downloading the amd64 builds - they will perform better as all the low-level stuff like mutexes and atomics, can take advantage of full 64 bit instructions, as well as the compiler being able to use all the full width of registers etc.

banks commented 8 years ago

And thanks for the report - we wouldn't have found that for some time without it even if it was only accidental that you used wrong binary :)!

x1unix commented 8 years ago

@banks Thanks :)