Open slush0 opened 1 year ago
Thank you for pointing this out. These specific errors are probably caused by the fact that bloom (not bloomZ) is currently held by a small number of non-redundant devices that crumble under any significant load.
Normally, both these errors would cause inference to switch to other peers, but as it stands, there are were no other peers at the time (and still now).
So, TL;DR it's broken right now; we're now working on fixing it, but might need a few evenings before we get it right.
Thank you for the answer! I didn't want to bloat the initial report, however my observation is that such errors are not directly related to node's utilization. I'm running one Bloomz node myself (on the same machine where I am launching this finetune script) and I realized I'm also getting various errors in finetuning script even from my own node. At that time, there was no utilization of the machine (tens of GB of free RAM, CPU load ~1 on eight core machine etc). I generally don't see that much utilization of my node when running Petals node so I don't expect there's that much traffic on the pool anyway.
However, I'm lacking enough expertise to help you debugging this cause more.
Relevant discussion in Discord:
I'm trying to run https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb and it is failing on default settings with these exceptions:
or
I don't see such errors when I lower hyperparameters like this. I successfully trained up to ~50% progress without any error, then I stopped it myself:
Beside this changes, I only changed
request_timeout=1800
according to previous discussions on Discord. I'm using currentgit master
of Petals. Model is Bloomz and at the time of reporting, the health.petals.ml reports "healthy" state of bloomz pool.