bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.89k stars 489 forks source link

Fix retries during inference #523

Open borzunov opened 9 months ago

borzunov commented 9 months ago

331 introduced a bug during inference retries that caused this:

[INFO] Route found: 0:18 via …1EBzGt
[WARN] [petals.client.inference_session.step:327] Caught exception when running inference via RemoteSpanInfo(peer_id=<libp2p.peer.id.ID (12D3KooWLRHAtX9ccW9i1NvpPLigwGX9MstGw3oyCZwmh21EBzGt)>, start=0, end=18, server_info=ServerInfo(state=<ServerState.ONLINE: 2>, throughput=1040.823002928876, public_name=':duck:FYY:sun_with_face:', version='2.2.0', network_rps=1185.4980980484086, forward_rps=9887.81852782432, inference_rps=343.100557603763, adapters=(), torch_dtype='bfloat16', quant_type='nf4', using_relay=False, cache_tokens_left=1179648, next_pings={...})) (retry in 2 sec): AssertionError("Broken input cache: span=RemoteSpanInfo(peer_id=<libp2p.peer.id.ID (12D3KooWLRHAtX9ccW9i1NvpPLigwGX9MstGw3oyCZwmh21EBzGt)>, start=0, end=18, server_info=ServerInfo(state=<ServerState.ONLINE: 2>, throughput=1040.823002928876, public_name=':duck:FYY:sun_with_face:', version='2.2.0', network_rps=1185.4980980484086, forward_rps=9887.81852782432, inference_rps=343.100557603763, adapters=(), torch_dtype='bfloat16', quant_type='nf4', using_relay=False, cache_tokens_left=1179648, next_pings={...})) shape=torch.Size([1, 579, 8192]) position=0 n_input_tokens=1")