How do I run this on a cloud server?

pranav-deshpande commented 1 month ago

Hi, how does one run this on the cloud and scale it across multiple gpus?

pranav-deshpande commented 1 month ago

I get this error after following the instructions for docker setup:

whisperfusion-1 | [07/29/2024-03:02:55] [TRT] [I] Serialized 22 timing cache entries whisperfusion-1 | [07/29/2024-03:02:55] [TRT-LLM] [I] Timing cache serialized to model.cache whisperfusion-1 | [07/29/2024-03:02:55] [TRT-LLM] [I] Serializing engine to Phi-3-mini-4k-instruct/rank0.engine... whisperfusion-1 | [07/29/2024-03:04:00] [TRT-LLM] [I] Engine serialized. Total time: 00:01:05 whisperfusion-1 | [07/29/2024-03:04:01] [TRT-LLM] [I] Total time of building all engines: 00:01:20 whisperfusion-1 | cp: cannot stat 'vocab.json': No such file or directory whisperfusion-1 | cp: cannot stat 'merges.txt': No such file or directory whisperfusion-1 exited with code 1

makaveli10 commented 3 weeks ago

@pranav-deshpande this is fixed in #59, feel free to test this out, tensorrt-llm supports running llm on multiple gpus => https://nvidia.github.io/TensorRT-LLM/architecture/core-concepts.html#multi-gpu-and-multi-node-support

collabora / WhisperFusion

How do I run this on a cloud server? #57