camsas / firmament

The Firmament cluster scheduling platform
Apache License 2.0
415 stars 79 forks source link

Make the coordinator less CPU-spin happy #30

Open ms705 opened 8 years ago

ms705 commented 8 years ago

Currently, the coordinator spins hard on incoming messages in the StreamSocketsAdapter when doing asynchronous receives. This was the easiest way of implementing the functionality, but is not actually required.

Instead, we should rework the StreamSocketsAdapter and StreamSocketsChannel code to use the Boost ASIO proactor pattern correctly (with a finishing async receive callback setting up the next async receive), or move to a libev or select-based implementation.

This isn't a correctness issue, but a performance one: the coordinator currently uses an entire CPU core on each machine. Usually, that's not an issue, but we may as well not waste resources unnecessarily.

mgrosvenor commented 8 years ago

An alternative option would be to move to CamIO2, which can be made to sleep, but will still have an asyc pattern for speed.

On 14 Jul 2015, at 4:57 pm, Malte Schwarzkopf notifications@github.com wrote:

Currently, the coordinator spins hard on incoming messages in the StreamSocketsAdapter when doing asynchronous receives. This was the easiest way of implementing the functionality, but is not actually required.

Instead, we should rework the StreamSocketsAdapter and StreamSocketsChannel code to use the Boost ASIO proactor pattern correctly (with a finishing async receive callback setting up the next async receive), or move to a libev or select-based implementation.

This isn't a correctness issue, but a performance one: the coordinator currently uses an entire CPU core on each machine. Usually, that's not an issue, but we may as well not waste resources unnecessarily.

— Reply to this email directly or view it on GitHub https://github.com/ms705/firmament/issues/30.

WIZARD-CXY commented 8 years ago

Now I met the same problem with the latest code, the coordinator just uses 100% cpu 2016-05-04 9 41 45 I'm not sure whether it is caused by this issue?

WIZARD-CXY commented 8 years ago

ping @ms705 Is there updates in this issue?

ms705 commented 8 years ago

Hi @WIZARD-CXY,

Sorry for the delayed response -- we've been bogged down with a bunch of paper deadlines recently.

This issues is still outstanding. Our plan to fix it is to replace our own home-baked, polling RPC layer with gRPC, which will fix the spinning. However, it will take us a while to implement this.

In the meantime, you can work around the issue by adding a sleep() or nanosleep() invocation at line 83 in stream_sockets_adapter.h. If you sleep there (e.g., for 5ms), you will slow down the receive loop, leading to reduced CPU consumption. However, your RPC processing latency will also be increased.

Hope that helps!

WIZARD-CXY commented 8 years ago

@ms705 Thanks, I will give it a shot tomorrow!

WIZARD-CXY commented 8 years ago

2016-05-16 3 14 48 Is this the right trick?I added usleep above. but the coordinator still using 100% cpu @ms705

ms705 commented 8 years ago

@WIZARD-CXY Ah -- this workaround only works if there actually are any channels (i.e., other running coordinators or running tasks). Otherwise, the code falls into the early exit condition on line 59 and you spin until channels show up.

Here's a patch that works even when there are no channels (with the same provision re RPC latency as above):

diff --git a/src/engine/node.cc b/src/engine/node.cc
index 5c01a75..6353e55 100644
--- a/src/engine/node.cc
+++ b/src/engine/node.cc
@@ -106,6 +106,7 @@ void Node::AwaitNextMessage() {
   VLOG(3) << "Waiting for next message from adapter...";
   m_adapter_->AwaitNextMessage();
-  //boost::this_thread::sleep(boost::posix_time::seconds(1));
+  usleep(500000);
 }

In other words, add the usleep invocation to AwaitNextMessage() in src/engine/node.cc. If you add this, you can get rid of the usleep statements in stream_sockets_adapter.h.

WIZARD-CXY commented 8 years ago

It works now@ms705, thanks