bigscience-workshop / petals

🌸 Run LLMs at home, BitTorrent-style. Fine-tuning and inference up to 10x faster than offloading
https://petals.dev
MIT License
8.89k stars 489 forks source link

Force use_cache=True #496

Closed borzunov closed 10 months ago

borzunov commented 10 months ago

Petals supports only use_cache=True for inference.

However, we should not reject use_cache=False since it returns identical results (just forces the slower O(n^3) inference algorithm instead of the O(n^2) one).

I allow use_cache=False since some models use this setting for reasons unclear to me (see https://huggingface.co/garage-bAInd/Platypus2-70B-instruct/discussions/8), and this led to AssertionError before this PR.