efeslab / Nanoflow

A throughput-oriented high-performance serving framework for LLMs
https://arxiv.org/abs/2408.12757
Apache License 2.0
474 stars 16 forks source link

The output is wrong when using serve.py. #5

Open alexngng opened 3 weeks ago

alexngng commented 3 weeks ago

I flow the instruction using WeightSaver.py to convert a meta-llama/Llama2-70B-base model. And then I use gen_req.py to produce test dataset. python3 gen_req.py "The University of Washington is located" 100 0 trace.csv The original model paths in the code repository were all set to "meta-llama/Llama2-70B-chat". I have changed them to the paths of the Llama2-70b models that I have downloaded locally. I use serve.py. python3 server.py --trace_path trace.csv

But the output file trace.csv.out is weird:

The University of Washington is located...,,,,......................................................,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, The University of Washington is located rom the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the the The University of Washington is located, profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile profile and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and and The University of Washington is located = the the the the the the the the the the the the the the the the the the the the the the the the the the the the The University of Washington is located'''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' The University of Washington is located = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = The University of Washington is locatedissfttytyftfttytytytytytyty height height height height height height height height height height height height height height height height height height height height height height height height height height height height extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra extra The University of Washington is located the’ Mu’’’’’’’’’’ Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu Mu’’’’’’’’’’’..’’’...........................//...//// The University of Washington is located synth organ organ organ organ organ,,,,,,,,,,

alexngng commented 3 weeks ago

GPU:4xA100-80G torch 2.4.0+cu121

serendipity-zk commented 3 weeks ago

Thanks for your question. Nanoflow works on 8*A100 only for the current version. When less than 8 cards are presented, Nanoflow assumes empty result for the missing GPUs, causing incorrect output.

alexngng commented 3 weeks ago

Thanks for your question. Nanoflow works on 8*A100 only for the current version. When less than 8 cards are presented, Nanoflow assumes empty result for the missing GPUs, causing incorrect output.

Thanks for your reply! I will test it on 8xA100.

aikitoria commented 2 weeks ago

Does it work on 8x other GPUs, such as 4090s? Or are only A100 supported?

CSEEduanyu commented 2 weeks ago

Thanks for your question. Nanoflow works on 8*A100 only for the current version. When less than 8 cards are presented, Nanoflow assumes empty result for the missing GPUs, causing incorrect output.

Will fewer Gpus be supported?

serendipity-zk commented 2 weeks ago

4090s do not have Nvlinks to efficiently move data between GPUs. Therefore, the pipeline needs to be re-designed to accommodate long communication time. We will work on supporting Nanoflow with fewer GPUs. However, fewer GPUs would decrease the batch size of the request and cannot reach the same throughput.