Assign specific jobs to dedicated workers

Nashtare commented 2 months ago

We are currently handling all proof jobs regardless of their underlying type by the same pool of workers. However in practice, Txn / Segment proofs are much heavier & slower than all other kind of aggregation proofs.

We should consider having some job assignment mechanism, probably relying on paladin's workers' routing keys, to assign a particular job queue to some specific pool of workers. This would allow us to select dedicated hardware for the different proving jobs we have when proving blocks, typically selecting much cheaper instances with fewer memory for higher levels of aggregation.

BGluth commented 2 months ago

Yeah I think this is actually pretty important.

Are we able to reasonably estimate cpu/memory needs for each txn/segment proof at this point? Idk if we want to go with some simple discrete ranking of machines (eg. light & heavy instances) or if we want to query the CPU & memory specs of each worker on startup and do something more dynamic.

Nashtare commented 2 months ago

We could do some benchmarking around the aggregation layers but these should be fairly light (we don't need anything else than the base circuits loaded from the ProverState), and the proving itself shouldn't take more than 4/5GB of RAM I'd assume. This would allow for a big drop in Memory / CPU ratio, while for segment proofs, t2d-60 (what we currently use) has a ratio of about 4 (240GB RAM / 60 vCPUs).

temaniarpit27 commented 2 months ago

@Nashtare @BGluth As discussed with @muursh , we have added a couple of features in this task. Added 2 modes - default (mode which works the way it is working right now where we can run any job on any machine), affinity (in this mode we need to provide different routing keys on different servers to enable workers).

Split mode details: Leader args:

--worker-run-mode affinity

this will put segment proof jobs and block proof jobs in different queues

Worker args:

--task-bus-routing-key heavy-proof/light-proof

This will start worker which will accept only messages from the corresponding queues

Currently in this version we dont have the functionality of running multiple queues on 1 machine Since this will be a multi node arch, we will also need to provide the correct amqp uri on leader and workers to make sure they connect to they same rabbitmq server.

Also, we will need either a cluster of rabbitmq or some kind of persistence for queues and messages

0xPolygonZero / zk_evm

Assign specific jobs to dedicated workers #507