chuckremes / ffi-rzmq

FFI bindings for ZeroMQ so the library can be used under JRuby and other FFI-compliant ruby runtimes
241 stars 62 forks source link

Memory access offset=0? #17

Closed gaiottino closed 13 years ago

gaiottino commented 13 years ago

Hi,

Trying to run some zmq code on an Ubuntu Jaunty server running on Amazon EC2. I've got zeromq-2.0.10 installed along with ruby 1.9.2p0. After a while, 2-3 minutes I get

/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll_items.rb:26:in +': Memory access offset=0 size=1 is out of bounds (IndexError) from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll_items.rb:26:inget' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:65:in register' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:113:inregister_writable' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:281:in register_writable' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/dripdrop-0.5.0/lib/dripdrop/handlers/zeromq.rb:84:insend_message' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/dripdrop-0.5.0/lib/dripdrop/handlers/zeromq.rb:153:in send_message' from tail_apache_zmq.rb:19:inblock in receive_data' from tail_apache_zmq.rb:17:in each' from tail_apache_zmq.rb:17:inreceive_data' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/eventmachine-tail-0.5.20101204110840/lib/em/filetail.rb:256:in read' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/eventmachine-tail-0.5.20101204110840/lib/em/filetail.rb:238:inblock in schedule_next_read' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/eventmachine-0.12.10/lib/eventmachine.rb:256:in call' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/eventmachine-0.12.10/lib/eventmachine.rb:256:inrun_machine' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/eventmachine-0.12.10/lib/eventmachine.rb:256:in run' from /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/dripdrop-0.5.0/lib/dripdrop/node.rb:33:inblock in start'

Haven't been able to find any information on how to fix this.

Regards,

gaiottino commented 13 years ago

Just now I got another segmentation fault I hadn't seen earlier. I've only seen it once whereas the one above makes a regular appearance after 2-3 mins.

/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:34: [BUG] Segmentation fault
ruby 1.9.2p0 (2010-08-18 revision 29036) [i686-linux]

-- control frame ----------
c:0007 p:---- s:0023 b:0023 l:000022 d:000022 CFUNC  :zmq_poll
c:0006 p:0071 s:0017 b:0017 l:000016 d:000016 METHOD /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:34
c:0005 p:0016 s:0012 b:0012 l:000011 d:000011 METHOD /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:382
c:0004 p:0031 s:0009 b:0009 l:000008 d:000008 METHOD /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:351
c:0003 p:0035 s:0006 b:0006 l:000bfc d:000005 BLOCK  /home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:84
c:0002 p:---- s:0004 b:0004 l:000003 d:000003 FINISH
c:0001 p:---- s:0002 b:0002 l:000001 d:000001 TOP   
---------------------------
-- Ruby level backtrace information ----------------------------------------
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:84:in `block in run'
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:351:in `run_once'
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/zmqmachine-0.3.2/lib/zm/reactor.rb:382:in `poll'
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:34:in `poll'
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-rzmq-0.6.0/lib/ffi-rzmq/poll.rb:34:in `zmq_poll'

-- C level backtrace information -------------------------------------------
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9(rb_vm_bugreport+0x79) [0xe62ae9]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xd2c4a1]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9(rb_bug+0x3a) [0xd2c55a]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xdea82d]
[0x8e9440]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so(ffi_call_SYSV+0x17) [0x3d1643]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so(ffi_call+0xd0) [0x3d1262]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so [0x3bd7c7]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9(rb_thread_blocking_region+0xeb) [0xe6a08b]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so(rbffi_CallFunction+0x159) [0x3be2a9]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so [0x3cb8d8]
/home/daniel/.rvm/gems/ruby-1.9.2-p0@apache_tailer/gems/ffi-1.0.1/lib/ffi_c.so [0x3d1771]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe4b5a0]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe5bfe2]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe513ac]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe5879c]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9(rb_vm_invoke_proc+0x2e1) [0xe5ad01]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe69b14]
/home/daniel/.rvm/rubies/ruby-1.9.2-p0/lib/libruby.so.1.9 [0xe69c49]
/lib/tls/i686/nosegneg/libpthread.so.0 [0x660545]
/lib/tls/i686/nosegneg/libc.so.6(clone+0x5e) [0x2f2a8e]

[NOTE]
You may have encountered a bug in the Ruby interpreter or extension libraries.
Bug reports are welcome.
For details: http://www.ruby-lang.org/bugreport.html

zsh: abort      bundle exec ruby tail_apache_zmq.rb
chuckremes commented 13 years ago

Thanks for the report. Here are a few things you should know plus one or two things you can try.

  1. MRI 1.9.anything has some threading flaws that may make working with 0mq somewhat frustrating. This is detailed in the ffi-rzmq README. If possible, try running your code with the latest JRuby and see if it crashes.
  2. I have made a few fixes to the ffi-rzmq gem but I have not yet bumped the version number and made a new release. Try cloning the github repository and building the gem ("ruby -S gem build ffi-rzmq.gemspec") and installing it from there ("ruby -S gem install ffi-rzmq-0.6.0.gem").
  3. If you have code that reliably crashes 0mq, please send it to me if you can so that I can reproduce it. That's the best way to get a fix.
andrewvc commented 13 years ago

By the way daniel, if you have any issues running under jruby, you can give it a shot under rubinius.

gaiottino commented 13 years ago

Hi. Thanks for your replies. I'll see if I can get you some examples of failing code but for now I've decided not to use zmq for the code I was working on as I found it doesn't perform well enough. When sending 50 messages per second it spent too much time in the zmq code causing the cpu to rise to about 20-30% constantly whereas sending the same messages directly over a eventmachine client-server solution hardly showed up on the cpu.

chuckremes commented 13 years ago

Definitely let us know if you have a reproducible example.

Also, I'm intrigued by your comment about performance. If you have a sample client & server that illustrates a perf problem, I'd love to see it. If there are opportunities to improve the performance of the ruby parts of this library, I definitely want to work on that.

In my projects I can send thousands of messages per second with very little CPU being consumed so I'm curious about your code and why it is running so hot for 50 msgs/s (which is not a lot at all).

gaiottino commented 13 years ago

I've done some testing now and pulled out some results using perftools. I can email you the PDFs. I initially posted the error here since the error occurs in ffi-rzmq.

But I'm seeing some strange things today when trying to create a small script that has the same issues. Right now I'm collecting data on an idle script. What I mean by idle is that I've only created a pub-connection. It immediately uses 30% of my cpu even though it's not sending any messages at all.

I'm starting to think that the real problem is not ffi-rzmq but DripDrop (or actually zmq_machine) to use zmq with eventmachine. It's spending most of it's time in ZMQMachine Reactor#poll. I can email you the PDF.

I did however see the same error as above in my simplified exampe. Should I email the files to you or paste the code here?

chuckremes commented 13 years ago

I recommend pasting the code into a gist (gist.github.com) and posting the link in this issue. We'll take a look at it and see what is going on.

As for the PDF, email it to me directly (cremes@mac.com) and I'll examine it.

Thanks for the follow-up.

andrewvc commented 13 years ago

I'm familiar with the CPU usage issue. I should make a note about it. Basically, the reactor does that if there are no listening sockets. If you create a sub socket and have it listen on any port (i recommend inprocess://nothing)it should fix it, you don't even need to connect anything to it. Chuck, do you get this with zmqmachine when you use it, or am I doing something wrong?

At any rate we are working on better em support, so this isn't a permanent problem.

andrewvc commented 13 years ago

My apologies, I didn't mean to close the issue, its been reopened.

gaiottino commented 13 years ago

@andrewvc: But the CPU remains at 30% even when I start the client and start sending messages.

I've simplified the code I was running and added it to this gist: https://gist.github.com/743522

In the example I'm tailing apache logs and just sending each row through zmq. Another strange thing I noticed was that in my simplified version I'm actually sending 400-500 requests per second. I've removed the code that determines if a request is interesting or not. The code I was running when I created the issue discarded about 90% of the requests and only send 40-50 messages per second through zmq. Even though the example has a much higher throughput the CPU still remains steady at 30% so I don't think it's the actual message sending that's heavy.

But it doesn't explain the regular crashes. I'm unable to test with JRuby since it's not installed on our servers. I'd have to setup something to test it locally but don't really have the time atm.

gaiottino commented 13 years ago

When running the simplified example the error occurs alot quicker. I've seen it as quickly as a few seconds after I start the server and client. I'm assuming that's because of the higher throughput.

andrewvc commented 13 years ago

For now try adding the nothing sub socket to the server. That should at least help with the CPU usage. I'll try running it myself later today, once I have a real computer (on my phone at the moment)

chuckremes commented 13 years ago

Thanks for the PDFs. They highlight an issue that I uncovered in ZMQMachine about a week ago (and have fixed). Using a SortedSet to contain the timers is extremely inefficient. I have since replaced that code. It will be part of the next ZMQMachine release which is scheduled for next Wednesday.

Also, please note that in the first PDF most of the processing time is consumed by BERT. It's funny, but I recently ran a set of benchmarks comparing the json, bert and msgpack gems with each other. BERT has some serious performance issues.

Andrew, you might want to consider replacing BERT with a more CPU-efficient packer like msgpack. I can send you my benchmarks (source & results) if you think that would be helpful.

In the meantime, I will try running the code in your gist and see if I get a crash. Thanks for reducing it!

chuckremes commented 13 years ago

I have figured out the high CPU usage. There is a flaw in the ffi-rzmq gem for socket polling. When polling, I pass in a poll interval which is the maximum amount of time it should block while waiting for a network event.

When there are no sockets defined, it returns from poll immediately regardless of the "block" interval that was defined. This essentially puts the reactor loop into a busy loop (imagine "while true; end") and it just burns up the CPU.

I'll fix that for the next ffi-rzmq release.

The other CPU issue is in ZMQMachine. By default, each reactor loop will block for 100 microseconds while waiting for a socket event. If the socket isn't doing anything, this 100 microsecond "wait" results in a busy loop just like described above (another "while true").

I chose 100 usec so that timer events and code set to run via next_tick would not get starved for time. I'll rethink how this should work in the case where there are no timer events and the socket is quiet.

BTW, my earlier comments about BERT still stand. That lib is not helping the CPU situation.

andrewvc commented 13 years ago

Chuck, checking out msgpack it looks like a perfect drop in replacement, in light of these perf problems, I'll work on replacing it, it should be pretty simple to do in dripdrop.

chuckremes commented 13 years ago

Okay, just pushed a few fixes for zmqmachine to the repository. Feel free to check out master and install it. Let me know if your problems persist.

https://github.com/chuckremes/zmqmachine

BTW, I was never able to reproduce a consistent crash. The crashes I saw were pretty normal when running 0mq with MRI. MRI has some threading issues.

I know you don't have JRuby in production, but perhaps you could test with it to see if it exhibits the same problem? It's pretty easy to put it under load by cat'ing a large file to a test logfile.

I'll leave this issue open until I hear back from you.

gaiottino commented 13 years ago

I'm using ffi-rzmq and zmqmachine from master and still seeing high CPU load. 60% on my laptop. I'll perform some tests with MRI and then with JRuby and send you the PDFs again.

chuckremes commented 13 years ago

Okay, one more tweak. Looks like dripdrop exposes a use-case where there are sockets open but not registered for any events. That case was polling instantly and causing the CPU to run away. I now check for that and sleep a bit.

I ran your test programs and saw that when no data was being published both programs settled in at around 1.5% CPU on my box.

When I slammed it with a 2GB file, the server spiked to 100% and the client to 90% to handle the workload.

I hope this fix solves it for you. Let me know!

gaiottino commented 13 years ago

The CPU issue seems to be resolved! It does not register when idle:ing. Great job!

I managed to send through 140k rows 3 times (size of the logfile I'm testing with) without seeing the error I reported above. I don't know if that means my initial bug report is fixed or if the performance fixes just made it less frequent.

I'll continue running tests tomorrow and will see how the code performs on the production servers under a longer period.

Looking forward to trying out the updated DripDrop without BERT since it's still spending a lot of time there as you've mentioned. Msgpack looks very nice.

Btw, the gemspec file for zmqmachine is wrong. Had to edit it before I could build the gem since it lists some deleted? files.

I was going to try using JRuby as well but can't install DripDrop as it tried to install thin which fails because JRuby doesn't support mkmf. I've only tried JRuby once a couple of years ago and remember not using it because of this issue but thought they had support for it now?

chuckremes commented 13 years ago

Glad to hear your issue has cleared up with these changes. I'm going to close this ticket for now. Please re-open a new one if you encounter a new problem (or the old one flares up again).

Regarding JRuby, they are adding in support for building and running C extensions. It is not yet part of a release. If you want to play with it, I recommend joining the #jruby channel on freenode.net and asking for guidance. I believe they plan to ship C extension support in JRuby 1.6 which is likely slated for 1st quarter 2011.

Thank you for your bug report.

andrewvc commented 13 years ago

I've updated dripdrop on github to use MessagePack instead of BERT. Please note that this means you can't use some BERT only features, like symbols as keys/values.

gaiottino commented 13 years ago

DripDrop is performing alot better without BERT. GJ! The gemspec file is wrong though. It's msgpack, not messagepack

chuckremes commented 13 years ago

You should probably open a ticket with the dripdrop project. :)