SymbioticLab / Oobleck

A resilient distributed training framework
Apache License 2.0
85 stars 5 forks source link

Batch distribution return None #16

Closed ZhuJiaqi9905 closed 7 months ago

ZhuJiaqi9905 commented 8 months ago

When nodes reconfigure occurs, the algorithm will calculate throughput with batch distribution, the relevant code is _distribute_batch function in instantiator.py. However, I find that the function will return None due to the following case:

  if not all(model.nb[i].value for i in model.I):
      return None

That will cause an error in creating a HeterogeneousPipelinesExecutionPlan object, since the member num_microbatches_set is None. I wonder how to fix it? Many thanks.

insujang commented 7 months ago

Currently the entire Oobleck is being refactored, including changing ILP solver with others (#18). I'll let you know when it is done. Please try it again later.

ZhuJiaqi9905 commented 7 months ago

Thanks. It seems that the Oobleck is rewrited in rust. That's cool!

insujang commented 7 months ago

Hi, #18 has just been merged. Could you please try again? Although it is newly written and functionally works, but still may generate wrong results. Please reopen this issue or create a new one if you see weird behavior. Thanks!