Open chenglimin opened 10 months ago
Thanks for your interest of our work! We tested with the output length of 128.
k what is the output lengt
Can PowerInfer support API server execution mode?
What are the dtype and activation function of Falcon-40b and OPT-30B when you evaluate vLLM on A100? As far as I know, vLLM does not support Relu, where do you obtain the parameter of Falcon-40b and OPT-30B running on vLLM? can you paste a download link?
Can PowerInfer support API server execution mode?
If you are referring to an API server, yes. You can use examples/server
for that purpose. It's basically the same as in llama.cpp.
What are the dtype and activation function of Falcon-40b and OPT-30B when you evaluate vLLM on A100? As far as I know, vLLM does not support Relu, where do you obtain the parameter of Falcon-40b and OPT-30B running on vLLM? can you paste a download link?
We tested ReluFalcon-40B and OPT-30B in FP16 format with ReLU activation function. vLLM supports Falcon and OPT architecture, and we just need to modify the Falcon's model config to use ReLU. We use OPT-30B as is, and you can download ReluFalcon-40B at Hugging Face.
where can I download the predictor of opt-30B model?
where can I download the predictor of opt-30B model?
We have not released the predictor of OPT models yet. The sparse inference implementation (code + predictor) for OPT models in PowerInfer is currently internal, reproducible, but not ready for open-sourcing yet. We will release the support of OPT models in the near future, and please stay tuned!
In the meantime, you can try to reproduce the predictor via the method of Deja Vu.
P.S: sorry for overwriting your comment by mistake🙏
where can I download the predictor of opt-30B model?
We have not released the predictor of OPT models yet. The sparse inference implementation (code + predictor) for OPT models in PowerInfer is currently internal, reproducible, but not ready for open-sourcing yet. We will release the support of OPT models in the near future, and please stay tuned!
In the meantime, you can try to reproduce the predictor via the method of Deja Vu.
P.S: sorry for overwriting your comment by mistake🙏
For Falcon-40B model, do you run with the AWQ or GPTQ quantization?
what is the input-length of Figure 13 for PC-high in your paper?
Please refer to the details as described in the paper for the most accurate information.
For Falcon-40B model, do you run with the AWQ or GPTQ quantization?
For the Falcon-40B model (as well as all others), we run the INT4 model with GGML's INT4_0 quantization method, not AWQ or GPTQ.
what is the input-length of Figure 13 for PC-high in your paper?
The input length mentioned in our paper refers to the number of tokens in the input prompt we used.
For Falcon-40B, when you compare vLLM in Figure 18, it is mentioned in the paper that vLLM is compared on a single card A100(80G) GPU, but when Falcon-40B (Float16) is directly run on vLLM, the GPU memory will be insufficient. So, the Falcon-40B running on the vLLM in Figure 18 are you using an INT4?
For Falcon-40B model, do you run with the AWQ or GPTQ quantization?
For the Falcon-40B model (as well as all others), we run the INT4 model with GGML's INT4_0 quantization method, not AWQ or GPTQ.
what is the input-length of Figure 13 for PC-high in your paper?
Sorry, I may not have made it clear. What I want to ask is what are the parameters of -t, -p and -n set when PC-High is tested in Figure 13 in the paper? What is the number of tokens in the input prompt you used? The input length mentioned in our paper refers to the number of tokens in the input prompt we used.
For Falcon-40B, when you compare vLLM in Figure 18, it is mentioned in the paper that vLLM is compared on a single card A100(80G) GPU, but when Falcon-40B (Float16) is directly run on vLLM, the GPU memory will be insufficient. So, the Falcon-40B running on the vLLM in Figure 18 are you using an INT4?
Nice catch! That's a testing detail that we cannot elaborate due to the page limit. A100(80G) cannot hold the entire model for sure. So in the test setting, we skipped the last transformer layer for all participating systems. In this way, the VRAM of A100 was just fully utilized, and the model is still in FP16 format.
What I want to ask is what are the parameters of -t, -p and -n set when PC-High is tested in Figure 13 in the paper? What is the number of tokens in the input prompt you used?
It aligns with all the other tests in our paper: 8 threads(-t
), any prompt with 8 tokens(-p
), output length(-n
) is indicated at the X axis for each case.
May I ask what is the output length of the experimental results in Figure 18 of your paper? The paper only mentioned that the input length was 1 and 64 respectively, and the batch size was 1, but did not mention the output length. Could you please provide your test conditions?