Open xorange opened 5 months ago
Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.
Hi there, Usually CUDA OOM is an expected behaviour, it seems that in your setup this is possible. Maybe you can try using int8 quantization, which saves a lot of cuda memory.
I'm not sure why that rtp-llm loads this model successfully, but then fail when provided with a chat template.
I did not even start to chat.
RTX 4090 24G, Qwen-7B-Chat
loads OK:
But the following causes OutOfMemoryError
I've tried with and without
export ENABLE_FMHA=OFF
I'm referring to this link SystemPrompt-TutorialFor the record, my requirement here are:
Because of requirement 1,
python3 -m maga_transformer.start_server
+ http post with OpenAI request is not the case. (Or if you could switch different adapter for a up running server, please tell me)