Closed x1unix closed 8 years ago
Thanks for the report. Has this happened regularly for you or just once?
Can you reproduce reliably? Does it happen right after startup or at some "random" time after it's be working fine for a while?
Looks like something racey in the metrics updating which we added recently. If it's a common occurrence and affecting your service availability, you can downgrade to 1.3.3 - the only major change was in metrics so unless you are relying on new metrics gathering it won't even be noticeable.
@FZambia Not sure when I'll have time to look at this as I have a few other priorities. I'll admit I didn't run race detector though.
@odin3 thanks a lot for the report! Please, give us answers for questions @banks asked.
@banks it was long time ago I last ran Centrifugo under race detector. Looks like there are some problems: https://gist.github.com/FZambia/261b1b15d7f059035f4f
One more: https://gist.github.com/FZambia/88bc50efec3728ea7f85 seems very similar to panic above
@banks Paul, please look at pr above - not sure I did it in optimal way but races gone away. Maybe you see a better way?
@odin3 I'll make new release with fix of race condition that most probably caused that panic later today. But just to be more sure we still need some answers on questions above from you - because if this happens reliably and you can easily reproduce this - our theory ruins:)
Seems good for now. Having seen this I think my metrics design was probably not the cleanest - I was optimising more for not having to hard code every metrics field in many places and have many different stir ts all holding different counts for same named metric. You're PR seems to please race detector and the traces seem to match the issue here so it's a reasonable assumption you fixed this. We can review cleansing up the metros further elsewhere.
@banks it happens after every API request to centrifugo. I use Centrifugo PHP api. I've downgraded to 1.3.3 and that's was a single solution at the moment.
@odin3 seems this caused because you are using 32-bit architecture. Unfortunately I can't proceed with this today because it's too late in Moscow, but this issue is the highest priority now - so will continue tomorrow.
@odin3 we were able to reliably reproduce that panic on 32-bit Centos. The problem was not in race conditions we assumed initially but in bug on some architectures (see this link). Now it should be fixed, just made new release (v1.4.1) that must resolve this issue. Please, try it out.
@FZambia I'm using 64-bit architecture (x86_64) on CentOS: CPU: Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz OS: CentOS Linux release 7.2.1511 (Core) (x86_64)
@FZambia looks like v1.4.1 is working without crash, I'll test it now.
@odin3 Your original panic shows asm_386.s
which shows you were using the i386 (i.e. 32 bit) binary. So even if your actual hardware is 64 bit AND your OS version is 64 bit (you can run 32 bit linux distros on modern processors and some do for example on small VPS to save memory on pointers), if you were using the i386 binary by mistake it still would have crashed due to this bug.
That's because the assembly code in sync/atomic
explicitly causes that crash for unaligned access on i386 builds.
But in general, if you are running 64 bit OS on 64 bit machine you should be downloading the amd64 builds - they will perform better as all the low-level stuff like mutexes and atomics, can take advantage of full 64 bit instructions, as well as the compiler being able to use all the full width of registers etc.
And thanks for the report - we wouldn't have found that for some time without it even if it was only accidental that you used wrong binary :)!
@banks Thanks :)
OS: Centos 7
Centrufugo version: 1.4.0
After request to centrifugo, service crashes with error:
Config.json: