JohnMcLear / draw

A real time collaborative drawing tool using nodejs, socket.io & paper.js
Apache License 2.0
483 stars 157 forks source link

Insane CPU usage on http://draw.etherpad.org #50

Closed tranek closed 11 years ago

tranek commented 11 years ago

Why is this happening?

tranek commented 11 years ago

So it looks like websockets are disabled on http://draw.etherpad.org?

warn - websocket connection invalid info - transport end (undefined)

It's spamming xhr-polling debug messages. Any time I see polling I get a little nervous (bad memories from college), but I don't think that is it.

The CPU spikes seem to be directly related to any kind of read/write to the database. It doesn't look like it's using the same database as Etherpad, so that can't be it. One thing that I've noticed is that the version of node on the server is a bit old. Would upgrading to the latest version v0.10.5 be a viable option?

If that doesn't outright fix it, I'd like to try switching over to MySQL to see if there's any difference there. That would have to be something that you do @JohnMcLear. If anything, that might help us limit the problem to dirty db interaction.

tranek commented 11 years ago

It's also possible that the importJSON() and exportJSON() functions are responsible as they are associated with db calls. I'm really really really hoping that the solution is just to upgrade node and that upgrading node is doable.

JohnMcLear commented 11 years ago

I very much doubt a node upgrade will fix it :)

JohnMcLear commented 11 years ago

As far as switching to MySQL, I think testing / profiling should be done before we start making changes to the environment :)

I was thinking this morning it should be possible to slightly modify the stress/load testing tool for Etherpad to broadcast drawing edits instead of etherpad changesets.. That might be the best way to go about addressing this issue as it will tell su how many lurkers/drawers a drawing can support.

https://github.com/JohnMcLear/etherdraw-stresstest is the url to the stress test tool you can modify / use :)

tranek commented 11 years ago

This is a tricky beast to crack. On one hand, I would argue that upgrading node could positively impact it since new node versions can come with newer V8 JavaScript engines, right?

I added some time checks to http://draw.etherpad.org and compared to my 4 virtual core local ubuntu VM on my laptop:

http://draw.etherpad.org: Time to load db = 1 ms Time to import JSON = 28 ms

Time to export JSON = 33 ms Time to save db = 1 ms

my local vm Time to load db = 2 ms Time to import JSON = 77 ms

Time to export JSON = 24 ms Time to save db = 0 ms

It's not importing/exporting JSON or the database.

Watching the server's debug output it looks like it goes through 3, 4, or more cycles of:

setting request GET /socket.io/1/xhr-polling/OmwfCKbXdFudgkHTW0Wj?t=1367674633937 debug - setting poll timeout debug - discarding transport debug - cleared close timeout for client OmwfCKbXdFudgkHTW0Wj debug - clearing poll timeout debug - xhr-polling writing 8:: debug - set close timeout for client PIJAmAMGz0B8LJeuW0Wh debug - xhr-polling closed due to exceeded duration

before it starts sending data (on a page refresh - sketch load). The only immediately apparent difference to me besides the node version is that it is using xhr-polling vs websockets (and I'm not behind a reverse proxy). Is there a reason why websockets are not working on http://draw.etherpad.org ? If it was a purposefully made decision, what is the reason behind it and is it possible to switch it back on for testing?

Also: I can never get my VM's node process over 5% CPU usage... No matter how hard I hit it. That's all local connections though.

JohnMcLear commented 11 years ago

It's not that tricky to crack, this is the process:

  1. Get the etherdraw-stresstest tool working.
  2. Test load
  3. Iterate until we find a way to improve performance :)

Being able to recreate a controlled test is really the only way to test software imho

tranek commented 11 years ago

I'm still going to bother you to upgrade node. You're not off the hook from that! :P

tranek commented 11 years ago

Checkout: http://draw.tranek.com/d/testing

I have an nginx 1.4.0 server running with node 0.10.5 in an Ubuntu 13.04 VM without Varnish.

Using xhr polling caused the excessive slowness when loading from the database, but the CPU usage was barely anything at all. As soon as I switched to websockets, the slowness from touching the database disappeared immediately. Still nothing more than a 1% lovetap on the CPU.

Any chance that you could have a few people test it and see how it performs? Would be nice to run the stresstest tool on it to see what happens...

JohnMcLear commented 11 years ago

Added https://www.varnish-cache.org/docs/3.0/tutorial/websockets.html to draw.etherpad.org

should mean websockets will work..

Still, we need stresstesting..

JohnMcLear commented 11 years ago

latest node being built now, should be done in 20 mins or so

JohnMcLear commented 11 years ago

is build, both services restart

tranek commented 11 years ago

It's loading leaps and bounds faster on http://draw.etherpad.org !!! I haven't checked CPU usage yet though.

Awesome :D

tranek commented 11 years ago

Is the CPU usage still high?

The websockets definitely made the responsiveness many times better. Is XHR polling supposed to be like that?

JohnMcLear commented 11 years ago

The CPU usage was high because I had a large number of people on a drawing.

I'm still a believer you are clutching at the wrong straw assuming that the different transport introduces overhead.

We should do controlled stress tests to test theories before casting assertions.

tranek commented 11 years ago

The transport was responsible for delay in loading the sketches (no claims about CPU usage or other responsive-issues). That I tested on my own machines.

Anyway, I'm going to close this issue for now. We have other things to worry about - why http://draw.etherpad.org isn't working and frontend tests.