Closed Msiavashi closed 2 months ago
It works fine with mistralai/Mixtral-8x7B-Instruct-v0.1; however, it is incredibly slow in generating even 10 tokens, taking over 40 minutes. Both DRAM and GPU memory usage increase at a very slow rate.
The model creation gets stuck at 94% and remains there for over 40 minutes until it finishes.
Model create: 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 930/994 [00:15<00:00, 2517.89it/s]
It works fine with mistralai/Mixtral-8x7B-Instruct-v0.1; however, it is incredibly slow in generating even 10 tokens, taking over 40 minutes. Both DRAM and GPU memory usage increase at a very slow rate.
The model creation gets stuck at 94% and remains there for over 40 minutes until it finishes.
Model create: 94%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 930/994 [00:15<00:00, 2517.89it/s]
We have some bug with the progress par, we are working on fixing the confusion. The first generation is slow since the parameters needs to read from disk initially, all cache needs to warm up first.
@drunkcoding I just wanted to follow up and let you know that I've observed a consistent delay in multiple runs. The initial run does take longer, but subsequent runs also experience significant execution time. As an example, here are the timing results for the third run of 'TheBloke/Mixtral-8x7B-v0.1-GPTQ':
real 8m13.200s
user 42m47.586s
sys 9m18.702s
Considering the setup of A100 80GB + 256GB of DRAM, is it normal to observe these timings based on your own experiments?
@drunkcoding I just wanted to follow up and let you know that I've observed a consistent delay in multiple runs. The initial run does take longer, but subsequent runs also experience significant execution time. As an example, here are the timing results for the third run of 'TheBloke/Mixtral-8x7B-v0.1-GPTQ':
real 8m13.200s user 42m47.586s sys 9m18.702s
Considering the setup of A100 80GB + 256GB of DRAM, is it normal to observe these timings based on your own experiments?
To keep us on the same page, multiple runs does not mean run the script multiple times, it means that feed more inputs to the model while the frameowrk is running. After the second input, it is very likely that we can observe the latency drop
for input in inputs
model.generate(input)
Hi. While trying to run the
readme_example.py
on A100 80GB I get the following error after waiting for around 10 minutes:CUDA 12.4 is installed and also added to the path correctly. Also, CUDA_HOME is set. I do not see this issue when running with hugging face transformers.
Any idea what the problem might be?