-
## Overview
In the future, we want to support **multiple ML backends** for each _endpoint_
Example:
`chat/completions` can use:
- llama.cpp
- candle
- tensorrt
`audio/transcriptions` can us…
-
When I pre trained a large language model on Tesla V100S-PCIE-32GB:
lightning run model --node-rank=0 --main-address=10.142.6.35 --accelerator=cuda --devices="0,1,2,3,6,7" --num-nodes=1 pretrain/tiny…
-
First of all, thank you for the great work in providing us with all those examples. However, when I tried to use the LLama model in HF format, it didn't work. I was able to convert the HF format model…
mzbac updated
9 months ago
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
### What happened?
Cannot quantized model due to missing metadata in the model card when using transformers model.push_to_hub does not provide it.
![image](https://github.com/user-attachments/asse…
-
I have used the default translation from the step 2, but sadly a lot of those translations at least from English to Polish are gibberish and absolutely terrible. https://huggingface.co/datasets/chryst…
-
**Describe the bug**
When compressing LLAMA2 with int4 weight, a error message come out during this step:
Step: ```compressed_model = nncf.compress_weights(ov_model, **model_compression_params)`…
-
Running a 1.1B / 5.0 bpw draft model alongside a 70B / 4.625 bpw model on a dual 7900 XTX system (ROCm 5.7.1, amdgpu driver 6.2.4).
On a near empty context I get a slight boost from around 12 t/…
-
Here is my sky setup, using `0.4.1` and `awscliv2` to submit. CC @mmcclean-aws & Team at Annapurna
```yaml
...
resources:
cloud: aws
# AWS inferentia, including neuronx
# https://githu…
-
### Describe the bug
I am on the `dev` branch right now! Very important to note.
I loaded `mistral-7b-instruct-v0.1.Q5_K_M.gguf` and `mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf` using llama.cpp and …