JustinTulloss / zeromq.node

Node.js bindings to the zeromq library
MIT License
1.65k stars 284 forks source link

Crash when creating 118 sockets #376

Closed briansorahan closed 9 years ago

briansorahan commented 9 years ago

I'm using zmq 2.8.0 installed with npm and seeing an uninformative C++ crash.

I've created a gist with example code and posted the output in a comment.

I am unable to reproduce with

ronkorving commented 9 years ago

It would be very helpful if you could narrow it down a bit more. Is it the node version, the libzmq version, the OS?

briansorahan commented 9 years ago

FWIW, here is a backtrace from lldb (crash.js is the program in OP's gist):

Brians-MacBook-Air:trystero-zeromq brian$ lldb node crash.js 
Current executable set to 'node' (x86_64).
(lldb) b V8::Dispose
Breakpoint 1: where = node`v8::V8::Dispose(), address = 0x00000001001300d0
(lldb) r
Process 72362 launched: '/Users/brian/.nvm/v0.10.33/bin/node' (x86_64)
MAX_SOCKETS=1023
creating sockets
0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,libc++abi.dylib: terminating with uncaught exception of type std::runtime_error
Process 72362 stopped
* thread #1: tid = 0x3f38c1, 0x00007fff8ac4b866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
    frame #0: 0x00007fff8ac4b866 libsystem_kernel.dylib`__pthread_kill + 10
libsystem_kernel.dylib`__pthread_kill + 10:
-> 0x7fff8ac4b866:  jae    0x7fff8ac4b870            ; __pthread_kill + 20
   0x7fff8ac4b868:  movq   %rax, %rdi
   0x7fff8ac4b86b:  jmpq   0x7fff8ac48175            ; cerror_nocancel
   0x7fff8ac4b870:  ret    
(lldb) bt
* thread #1: tid = 0x3f38c1, 0x00007fff8ac4b866 libsystem_kernel.dylib`__pthread_kill + 10, queue = 'com.apple.main-thread', stop reason = signal SIGABRT
  * frame #0: 0x00007fff8ac4b866 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff92e9f35c libsystem_pthread.dylib`pthread_kill + 92
    frame #2: 0x00007fff97de3b1a libsystem_c.dylib`abort + 125
    frame #3: 0x00007fff92a66f31 libc++abi.dylib`abort_message + 257
    frame #4: 0x00007fff92a8c952 libc++abi.dylib`default_terminate_handler() + 264
    frame #5: 0x00007fff923ff322 libobjc.A.dylib`_objc_terminate() + 124
    frame #6: 0x00007fff92a8a1d1 libc++abi.dylib`std::__terminate(void (*)()) + 8
    frame #7: 0x00007fff92a89c5b libc++abi.dylib`__cxa_throw + 124
    frame #8: 0x0000000103149d52 zmq.node`Socket(this=0x0000000100b28c40, context=<unavailable>, type=<unavailable>) + 364 at binding.cc:487
    frame #9: 0x00000001031467e1 zmq.node`zmq::Socket::New(args=0x00007fff5fbfdc28) + 391 at binding.cc:352
    frame #10: 0x00000001001567fc node`v8::internal::Builtin_HandleApiCallConstruct(v8::internal::(anonymous namespace)::BuiltinArguments<(v8::internal::BuiltinExtraArguments)1>, v8::internal::Isolate*) + 588
kkoopa commented 9 years ago

Thanks, that is useful. Seems it is crashing here due to throwing an exception. This is because zmq_getsockopt returns -1 for some reason http://api.zeromq.org/master:zmq-getsockopt , but I don't see which of EINVAL, ETERM, EFAULT, EINTR was the cause, although that should be part of the exception message. Don't know why this happens either.

kkoopa commented 9 years ago

My initial guess is that os x sucks because you are hitting the maximum open number of file handles problem. Seems it defaults to 256, which is very low. Try increasing it somehow and see if that helps.

briansorahan commented 9 years ago

It does seem to be an OS X problem, but

Brians-MacBook-Air:zeromq.node brian$ ulimit -n
256

And this program can get the file descriptor of 122 zmq sockets, but for every socket after that reports Socket operation on non-socket, despite the fact that ZMQ_MAX_SOCKETS reports 1024.

After doing ulimit -n 1024 the above program starts misbehaving at the 506th call to zmq_getsockopt.

This seems to just be an annoying issue with Mac, and I just found in the zmq tuning guide that they recommend doing ulimit -n 1200, however

Brians-MacBook-Air:zeromq.node brian$ sudo ulimit -n 1200
Password:
Brians-MacBook-Air:zeromq.node brian$ ulimit -n
1024

I'll close, and possibly discuss upstream.

kkoopa commented 9 years ago

Those two links I posted claim to show the necessary steps. Just doing ulimit -n won't cut it.

On OS X, the open file limits are governed by launchd and sysctl values.

launchd: Processes are started by launchd, which imposes resource constraints on any process it > launches. These limits can be retrieved and set using the launchctl command (the default soft and hard values are 256 and unlimited, respectively). For OS X 10.7 and later, even though the default hard limit is "unlimited", you can't set the hard or soft limit to "unlimited" yourself.

sysctl: Operating system open files limits are set with sysctl. These limits can also impact running processes, so the launchd and sysctl open file limits should be set to the same values.

briansorahan commented 9 years ago

I think it might be a good idea to check for non-NULL returned from zmq_socket here. I modified the 2nd gist and discovered that this is the case when I hit the fd limit. zmq_strerror(zmq_errno()) returns Too many open files. Any interest in a PR with this change?

reqshark commented 9 years ago

+1 that check sounds very reasonable

reqshark commented 9 years ago

wouldn't it be a negative integer returned or would it be non-NULL?

briansorahan commented 9 years ago

zmq 4.x says

The zmq_socket() function shall return an opaque handle to the newly created socket if successful. Otherwise, it shall return NULL and set errno to one of the values defined below.

reqshark commented 9 years ago

ya that sounds correct, since the socket is a void star you can do that

briansorahan commented 9 years ago

I'm curious why throw std::runtime_error(ErrorMessage()) was not showing Socket operation on non-socket in my terminal after the zmq_getsockopt(...ZMQ_FD) call using a NULL socket.

I would guess zmq_errno() is not returning the expected value?

briansorahan commented 9 years ago

zmq_errno returns 24, which is what it returns in my gist as well. Maybe the fact that I didn't get the strerror in my terminal is a Mac issue as well, since

#include <stdexcept>
int main() {
    throw new std::runtime_error("foo");
}

outputs

libc++abi.dylib: terminating with uncaught exception of type std::runtime_error*
Abort trap: 6
briansorahan commented 9 years ago

I'll send a PR here very soon, but do you care if I just fprint and exit instead of throwing when zmq_socket returns NULL? I just ran the simple C++ program above on an Ubuntu 14.04 VM with libstdc++-4.8 installed through apt and it still doesn't put the string I pass to std::runtime_error in my terminal.

reqshark commented 9 years ago

I'd like to see this run across the test suite, so I just sent a PR to fix that.. let's wait and see what the others think

reqshark commented 9 years ago

oops i should have added my comment on your PR, here lets reference it. https://github.com/JustinTulloss/zeromq.node/pull/377