Lightning-AI / litgpt

20+ high-performance LLMs with recipes to pretrain, finetune and deploy at scale.
https://lightning.ai
Apache License 2.0
10.83k stars 1.08k forks source link

Multi-GPU training extremely slow #337

Closed PythonLimited closed 1 year ago

PythonLimited commented 1 year ago

Hi there, I want to fine-tune the openly 13b via the adapter.py and I want to evaluate the performance using two (or more) GPU's before buying. Currently I tried multiple instances via vast.ai (which runs in containers to my knowledge, might be a issue?) and im experiencing a huge slowdown compared to my single A5000. Instance types I tried include 2x A6000 no NVLink, pcie4 x16. I downloaded the openllm-13b and converted it to lit format, downloaded the alpaca, and then ran the adapter.py, where I edited the device count beforehand, via this command: python finetune/adapter.py --checkpoint_dir ... --precision bf16-true (bf16-mixed would lead to a OOM. Could anyone confirm this "slowdown" behavior or correct me here? I can also encounter this when just doing inference via the generate/ scripts.

The main time constraint seems to be the model.forward() with a respective 3.41sec per call.

carmocca commented 1 year ago

From your description, it's not clear to me if you mean that using 2 devices per machine is slower or if using 2 machines is slower.

In the former case, the communication cost should be worth the extra compute power. In the latter, it depends on the connectivity between machines. I'm not familiar with vast.ai specifically.

PythonLimited commented 1 year ago

From your description, it's not clear to me if you mean that using 2 devices per machine is slower or if using 2 machines is slower.

In the former case, the communication cost should be worth the extra compute power. In the latter, it depends on the connectivity between machines. I'm not familiar with vast.ai specifically.

I'm talking about 1 machine with 2x GPU. I'll rerun the tests in a few days just to make sure I haven't gotten anything wrong/there are no updates I missed...