bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
9.25k stars 525 forks source link

Finetuning with personachat example is not working #255

Open slush0 opened 1 year ago

slush0 commented 1 year ago

I'm trying to run https://github.com/bigscience-workshop/petals/blob/main/examples/prompt-tuning-personachat.ipynb and it is failing on default settings with these exceptions:

Feb 08 10:24:01.406 [WARN] [/home/dev/projekty/petals-proj/finetuning/petals/client/sequential_autograd.py.sequential_forward:101] Caught exception when running forward via RemoteSpanInfo(start=16, end=43, peer_id=<libp2p.peer.id.ID (12D3KooWRftAHGeKyYmq35tn5Daqiu4D9767xUfZJX7E6LH2yKs9)>) (retry in 0 sec): RuntimeError("shape '[2, 272, 14336]' is invalid for input of size 1048576")

or

Feb 08 10:23:51.349 [WARN] [/home/dev/projekty/petals-proj/finetuning/petals/client/sequential_autograd.py.sequential_forward:101] Caught exception when running forward via RemoteSpanInfo(start=16, end=43, peer_id=<libp2p.peer.id.ID (12D3KooWRftAHGeKyYmq35tn5Daqiu4D9767xUfZJX7E6LH2yKs9)>) (retry in 0 sec): P2PHandlerError('Failed to call handler `TransformerConnectionHandler.rpc_forward_stream` at 12D3KooWRftAHGeKyYmq35tn5Daqiu4D9767xUfZJX7E6LH2yKs9: ')

I don't see such errors when I lower hyperparameters like this. I successfully trained up to ~50% progress without any error, then I stopped it myself:

NUM_PREFIX_TOKENS = 4 #16
MODEL_MAX_LENGTH = 50 #256
BATCH_SIZE = 2 # 8

Beside this changes, I only changed request_timeout=1800 according to previous discussions on Discord. I'm using current git master of Petals. Model is Bloomz and at the time of reporting, the health.petals.ml reports "healthy" state of bloomz pool.

justheuristic commented 1 year ago

Thank you for pointing this out. These specific errors are probably caused by the fact that bloom (not bloomZ) is currently held by a small number of non-redundant devices that crumble under any significant load.

Normally, both these errors would cause inference to switch to other peers, but as it stands, there are were no other peers at the time (and still now).

So, TL;DR it's broken right now; we're now working on fixing it, but might need a few evenings before we get it right.

slush0 commented 1 year ago

Thank you for the answer! I didn't want to bloat the initial report, however my observation is that such errors are not directly related to node's utilization. I'm running one Bloomz node myself (on the same machine where I am launching this finetune script) and I realized I'm also getting various errors in finetuning script even from my own node. At that time, there was no utilization of the machine (tens of GB of free RAM, CPU load ~1 on eight core machine etc). I generally don't see that much utilization of my node when running Petals node so I don't expect there's that much traffic on the pool anyway.

However, I'm lacking enough expertise to help you debugging this cause more.

borzunov commented 1 year ago

Relevant discussion in Discord:

Screenshot 2023-03-13 at 06 10 37