-
I see that Runpod has a serverless option. Rather than stopping and starting these instances, is it possible to use these models serverless? It looks like you can modify theBloke's dockerfile and conf…
-
Hi, dear:
Thanks for your open source!
How did you overcome the catastrophic forgetting problem in lora finetune.
The performance dropped a lot on humaneval dataset after lora finetune on my own …
-
Does someone else also has the problem that after chat request the cpu load do not decrease?
I'm using [CodeLlama-34B-Instruct-GGUF](https://huggingface.co/TheBloke/CodeLlama-34B-Instruct-GGUF/blob…
-
### Reminder
- [X] I have read the README and searched the existing issues.
### System Info
## QWEN2-1.5B(0.5B)
正常
## QWEN2-7B(MoE)
需要使用bf16 #4278
正常
## QWEN2-72B
正常,有一点点问题,只能在8卡上启动(s…
-
Do you have plans to support other LLM models like Llama 3?
Or would it be easy to modify code implementing interface to OpenAI. I would like inerface using Ollama.
Any hints would be appreciate…
-
### Before submitting your bug report
- [X] I believe this is a bug. I'll try to join the [Continue Discord](https://discord.gg/NWtdYexhMs) for questions
- [ ] I'm not able to find an [open issue](ht…
-
### Summary
# Motivation
WasmEdge is a lightweight inference runtime for AI and LLM applications. Build specialized and finetuned models for WasmEdge community. The model should be supported by Wa…
-
**Please describe the feature you want**
I've been using a large completion model with my GPU. I'd like to add a chat model as well, but there's not enough GPU memory for the large completion model…
-
### Your current environment
How do I get vllm to support Codellama-34B in openai format?
I run TheBloke/CodeLlama-34B-Instruct-AWQ in vllm, but it show 'No chat template provided. Chat API will n…
x0w3n updated
2 weeks ago
-
### System Info
- CPU architecture (x86_64)
- CPU/Host memory size (64GB)
- GPU properties
- GPU name (1x NVIDIA V100)
- GPU memory size (32GB)
- Libraries
- TensorRT-LLM branch or tag …