gevulotnetwork / gevulot

Gevulot is an internet scale compute network for zero-knowledge proof generation and verification.
https://gevulot.com
Apache License 2.0
154 stars 48 forks source link

When `ReousceError` happens, sheduler should try to `pick_task` #162

Closed ghostant-1017 closed 7 months ago

ghostant-1017 commented 7 months ago

Assume we have 3 programs in pending_programs, we successfully start the 1st and 2nd programs, but fail to start_program 3rd, the logic should be continue to pick_task, right?

Please correct me if I was wrong.

tuommaki commented 7 months ago

Assume we have 3 programs in pending_programs, we successfully start the 1st and 2nd programs, but fail to start_program 3rd, the logic should be continue to pick_task, right?

No, I think.

There are two things here:

  1. The original pending_programs queue has been cloned. Every element in it should be processed.
  2. In the middle of the cloned pending_programs queue, there can be programs that have very high resource requirements, that don't fit the node at the moment, but later in the queue there might be less resource intensive programs that do fit - those should be scheduled to run, before moving to next new task.

The scheduling policy here is ultimately a decision on priority vs. utilization. Currently we don't have proper real world experience from all different kinds of workloads, so we don't know what's the best approach in terms of throughput vs. fairness.

The current implementation does have a negative property in terms of favoring small tasks and possibly discriminating very large tasks, but currently I'm not sure if this is a bad thing. The thing is that given the circumstances in devnet, this reduces chances for annoying or malicious behavior.

ghostant-1017 commented 7 months ago

Got it, I'm new to Gevulot, I'd try to understand it. Thank you for your explanation! @tuommaki