Add a job/debug monitor stream(s) to the server

robertmaynard commented 9 years ago

Issue

By leveraging the pub-sub zeroMQ model we can allow the server to start broadcasting a stream of status and monitoring information.

This solves two large and outstanding issue when dealing with Remus. The first is that the server is a black box that has zero ways of informing clients or third parties about what is happening, if any internal errors are occurring, etc. The second issue is that the client is limited to using a busy wait to check on status events occurring, but if we allow the client to also use a pub / sub style connection to the server we can make a more efficient status monitoring client.

Technical Issues

The primary issues with the pub / sub model is the classic slow joiner issue. The problem is that you can't determine when a subscriber starts to get messages. Even if the subscriber is started before the publisher, the subscriber will always miss the first few messages that the publisher sends. This is because as the subscriber connects to the publisher, the publisher will have already sent messages that will be missed by the client.

If the monitor was emitting just general status messages and the goal was to show the overall health of the server, the slow joiner issue would not be a problem. But as it can be used to monitor specific jobs we need to some way to minimize the severity or even occurrence of the slow joiner . A couple of decent solutions are proposed in the Node Coordination ( http://zguide.zeromq.org/page:all#Node-Coordination ) section of the ZMQ guide. Personally I think the best way for Remus is:

Server opens PUB socket and starts sending non job related messages and regular heartbeat messages.
Client / ServerMonitor connect SUB socket and when they wait for a message to arrive from the PUB Socket. From there they send a message over the classic Req/Rep client socket to the server stating what channels should be created ( e.g. start sending info for all jobs )
Now that the publisher has all the necessary information, it starts to send real data.
Extending the Server

Here is a very high level requirements for the publication on the server

Pub socket will always exist
Pub connection details will be controlled by remus::server::ServerPorts
A New request type will be added to the classic client interface. This request response will be the endpoint for publication socket. This solves the entire discovery problem that you have with figuring out the port the pub socket has bound too. This also means that anything that wants to act like a server monitor will have to use both a req/rep and pub/sub socket.
Server method variables will control the type of information broadcasted on the Pub socket. The classic verbosity level controls of DEBUG, WARN, ERROR are a parallel issue to the pub socket. A client might only care about Job status publications and not about the general health of workers that are connected, in that use case the classic logging levels are not useful. Instead I propose we use:
- Jobs: Job status information, formatted in a way that a sub can filter based on job uuid.
- Worker: Information about what and when workers are connecting, taking jobs, asking for jobs and heart beating.
- Errors: System wide errors only. This will include server exceptions, workers being marked as dead, jobs failing, etc.
  Client monitoring of the Server

To monitor the activity of the server a new remus::client class called Monitor`` (orServerMonitor``` ?) will be created. This class must be extensible so that the user can plug it into their own code easily. A quick draft of what the Monitor class would look like is:

class Client
{
public:
  ...
  remus::client::Monitor monitorServer();

};

**Edited: With new Monitor design**

class Monitor
{
public:
  typedef remus::function<void(const std::string& domain,
                                                  remus::thirdparty::cJSON* msg,
                                                  remus::Client* source)>  MonitorFunction;

  std::set<std::string> domains() const;

  //func is expected to have the following type signature
  // operator()(const std::string& domain, cJSON* msg)
  //
  //will return the domain string that can be used to unsubscribe
  void subscribe(const std::string& domain, MonitorFunction function);

  void unsubscribe(const std::string& domain);

};
};

So it than becomes fairly easy to construct a JobMonitor


class JobMonitor
{
   JobMonitor( remus::proto::Job job,
                       remus::client::Monitor monitor);

   remus::proto::JobStatus latestStatus();

   //maybe even allow buffering of status
   std::vector< remus::proto::JobStatus > BufferedStatus;
};

Pub/Sub Message Layout

The message layout will be required to be a multipart message as ZMQ only supports prefix filtering. So that means that the first message component will be the key we will need to filter on.

The easiest method will be to make the first message in itself a key value pair where the key component is one of the following:

Job
Worker
Error

And the value component is the following:

Job value would be the UUID of the job
Worker values would be the socket id of the worker in md5 form
Error would be the component that caused the error, initially the placeholder 'server' can be used
References:
http://zguide.zeromq.org/page%3aall#Pub-Sub-Message-Envelopes
http://zguide.zeromq.org/page:all#Node-Coordination
http://redis.io/topics/pubsub

robertmaynard commented 9 years ago

@vibraphone for your review.

vibraphone commented 9 years ago

Some issues:

It looks like the Client creates a Monitor but users must subclass monitor. How would the client know what subclass to create inside monitorServer()?
Rather than having a different API to subscribe/unsubscribe to different services (job, worker, error), how about just having

class Monitor
{
  std::set<std::string> domains() const;
  void subscribe(const std::string& domain);
  void unsubscribe(const std::string& domain);

  virtual void event(const std::string& domain, cJSON* msg);
};

With a well-documented and tested set of message formats (in JSON), this would be super-easy to use. I agree that it would be nice to have a utility for monitoring a single job or error messages. That could simply be a subclass of Monitor that takes a std::function (or boost::function if not in C++11 mode) and calls it when a regular expression on the domain and/or message is matched?

robertmaynard commented 9 years ago

Now this is why I like typing it all out.

You are correct the initial design is impossible, and I prefer your design for the Monitor class, but I think that it should allow the user to specify the callback via boost::function in the subscribe call.

Now onto the issue of domains. I want to leverage as much of ZMQ pub/sub model as possible and it has the feature that it only sends messages to subscribers that have a matching prefix subscription. What this means is that regex matching would require us to subscribe to all messages and filter afterwards. Rather I would rather have the Monitor class force explicit subscriptions, but at the same time have a very simple user API so how about:

so something like:

class Monitor
{
  std::set<std::string> domains() const;

  //func is expected to have the following type signature
  // operator()(const std::string& domain, cJSON* msg)
  //
  //will return the domain string that can be used to unsubscribe
  void subscribe(const std::string& domain, boost::function func);

  void unsubscribe(const std::string& domain);

};

vibraphone commented 9 years ago

That looks good so far, but

Why not typedef remus::function to either std::function or boost::function depending on what's available? That future-proofs it.
Should we include a reference/pointer to the Monitor in the callback? Granted, boost::function allows you to store it in a functor if needed, but it might be nice to have. For that matter, since the Client owns the monitor and could also be useful in a callback, why not pass it:

  void subscribe(
    const std::string& domain,
    remus::function<void(const std::string& domain, cJSON* msg, Client* source)>);

robertmaynard commented 9 years ago

I have never added the functionality to remus to do boost / c++11 detection like smtk has. I have no problem adding this functionality, but I would rather it do it as a separate issue. So if somebody wants to start branches to move all boost::shared_ptr, boost::thread etc over to the remus namespace go for it.
I think a pointer back is perfectly fine, we just stipulate that the lifespan of the client must be as long as all monitor created by it.
I have updated the initial issue with these changes and have filled in a little more of JobMonitor. I think it would also be a perfect place to implement storage of all status messages instead of the current implementation of only getting the latest one the server has.

vibraphone commented 9 years ago

Looks good to me.

robertmaynard commented 9 years ago

Need to have a user facing class which is called ServerEventLogger.

The ServerEventLogger will allow the user to log to a std::ostream all information that is being broadcasted on the Event stream.

Usage of the ServerEventLogger will be roughly.

remus::server::Server server;

//start accepting connections for clients and workers
bool valid = server.startBrokering();

if(valid)
  {
  ServerEventLogger logger = server.constructServerLogger();
  std::fstream outFile("server.log", std::ios::out);
  logger.start( outFile );
  }

vibraphone commented 9 years ago

Events that would be nice for the server to log include:

Server starting.
Server looking for worker files in directory (include the directory list).
Server accepts a connection from a worker.
Server accepts a connection from a client.
Server received job request from client.
Server assigned job to worker (or perhaps received job request with no workers that can field it).

robertmaynard commented 9 years ago

Those all seem reasonable, sending messages for when a job has no available workers will require some additional work. Mainly due to the fact that we try numerous times per second to match queued jobs to current workers, and I don't want to emit a notification each time we do that.

vibraphone commented 9 years ago

@robertmaynard Maybe only emit a message when the number of unresolved jobs changes?

Kitware / Remus