Parallel submission got extremely inefficient

emorice / galp

Incremental distributed python runner

MIT License

0 stars 0 forks source link

Parallel submission got extremely inefficient #87

Closed emorice closed 6 months ago

emorice commented 1 year ago

Somewhere during the last refactors, as the timing of message processing evolved, the behavior when sending many tasks got quite bad. Practically, this translated to the timeout of test_parallel exploding.

Edit: current timing on orion to illustrate:

_____________________________ test_parallel_tasks ______________________________
----------------------------- Captured stdout call -----------------------------
Warmup: 8.232781887054443s
Run   : 3.2642455101013184s

emorice commented 1 year ago

From a quick look, I think that various design choices -- most importantly, the on-demand start of worker, but also the more generic message parsing and validation -- have increased a lot the time to answer the first wave of STAT. Because of this, a faulty behavior when several STAT replicas are processed emerged, while the time window was too short before to make bugs of that sort likely.

I don't really see a faulty behavior form the client, the client does send duplicate requests at the agreed interval, and stops as soon as the first one get through. The problem is that all the replicas seem to have been queued and not dropped.

emorice commented 1 year ago

The behavior of the broker isn't actually wrong either, the allocations get added and removed sequentially. But what happens is that, probably because of the fair queuing rules of the router socket or whatever, the broker sees the response from the worker before it has processed the replicas. So the replica are queued in the zmq-level queue before the first answer, but the broker doesn't get around processing them until after the first answer ; from which it concludes that these are new requests and need not be dropped.

This is not particularly surprising ; it points out to broker-side dropping indeed being a temporary solution but not a good way to ensure requests are processed exactly once.

emorice commented 6 months ago

Fixed with 413acd8fb0b7b74973b68e7c8922e1bfc3d8f31e as we don't re-send requests anymore On my old cheap laptop:

------------------------------------------------------------------------- Captured stdout call --------------------------------------------------------------------------
Warmup: 3.308 s
Run   : 1.584 s

The warmup is still significant because we need to wait for many processes to spawn, and there's still quite some overhead, but that's not catastrophic anymore.