gearman / gearmand

http://gearman.org/
Other
740 stars 137 forks source link

Protocol addition to bulk register jobs (x CAN_DO -> 1 MASS_DO) #6

Open hashar opened 8 years ago

hashar commented 8 years ago

OpenStack Zuul relies on Gearman to run jobs on a farm of 800 nodes each having 20000 functions. The massive amount of CAN_DO cause stress to the server. @jeblair went with an expansion of the Gearman protocol that let ones register multiple functions in a single MASS_DO command.

Their software is in python and the protocol expansion is handled via child class: https://github.com/openstack-infra/zuul/commit/d43715988766fa95b88af5fc2c9d2c2aa723b4f9 . For the worker:

class GearWorker(gear.Worker):
    MASS_DO = 101

    def sendMassDo(self, functions):
        data = b'\x00'.join([gear.convert_to_bytes(x) for x in functions])
        self.broadcast_lock.acquire()
        try:
            p = gear.Packet(gear.constants.REQ, self.MASS_DO, data)
            self.broadcast(p)
        finally:
            self.broadcast_lock.release()

On the server side:

    def handleMassDo(self, packet):
        packet.connection.functions = set()
        for name in packet.data.split(b'\x00'):
            self.log.debug("Adding function %s to %s" % (
                name, packet.connection))
            packet.connection.functions.add(name)
            self.functions.add(name)

So that is merely \x00 joining the payload of several CAN_DO packets in a single one. That saves you all the overhead of a packet handling.

Would that be a feature that could be added to the reference Gearman protocol?

A bulk equivalent for CAN_DO_TIMEOUT could be of interest as well.

SpamapS commented 8 years ago

This sounds awesome. I'm a zuul user myself and I can see how this could happen. Could you prepare a patch to the protocol docs and maybe a patch for the C/C++ server as well? Thanks!