Closed Judtoff closed 5 months ago
[!IMPORTANT]
Review skipped
Review was skipped as selected files did not have any reviewable changes.
Files selected but had no reviewable changes (1)
* glados_config.ymlYou can disable this status message by setting the
reviews.review_status
tofalse
in the CodeRabbit configuration file.
The LlamaServerConfig
class in llama.py
has been enhanced with two new boolean attributes, enable_split_mode
and enable_flash_attn
, allowing for more customizable server configurations. Additionally, the glados_config.yml
file has been updated to reflect these changes and include new settings for LlamaServer
. This update also changes the interruptible
setting for Glados
and updates the model path.
File | Change Summary |
---|---|
glados/llama.py |
Added enable_split_mode and enable_flash_attn attributes to LlamaServerConfig class. |
glados_config.yml |
Updated interruptible setting for Glados , changed model_path , and added new server settings. |
🐰✨ In code's realm, new features bloom, Split modes and flash attention loom. Configs updated, paths aligned, A smoother server, finely designed. With every change, a brighter tune, The Llama dances to the moon. 🌕🎶
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?
If the flags for split-mode row and flash attention aren't included in glados_config.yml we should probably set them to False by default in llama.py. I ran into some issues when recompiling llama.cpp this afternoon, looks like at some point llama.cpp server got renamed llama-server. There is a good chance " enable_flash_attn: bool = True " in llama.py will cause issues for people. Changing it to "enable_flash_attn: bool = False" solves the issue (or if they get the right branch of llama.cpp where server supports flash attention then they won't have an issue.
Alternatively, include this in glados_config.yml
enable_flash_attn: false
Sorry for the confusion this has caused, I should not have set those booleans to True in llama.py Thanks, -Jud
I apologize if this causes any issues, I have never used git before, there likely are a dozen best practices I've missed. I've added a pair of flags to glados_config.py, one for Split Mode Row and one for Flash Attention. I've tested these on 3x NVIDIA P40s to confirm it is working as expected. I updated llama.py to take these two flags and pass it to the server command. I used the same format as the context length that gets pushed to the server command. Using these flags will result in a speedup with multiple GPUs. llama.cpp has documentation on these flags.
Summary by CodeRabbit
New Features
enable_split_mode
andenable_flash_attn
settings to the configuration, allowing more customization of the server behavior.Configuration Updates
interruptible
setting forGlados
tofalse
.model_path
forLlamaServer
to"./models/Meta-Llama-3-70B-Instruct-Q5_K_M_.gguf"
.