-
Currently there is no way to use large models hence there is no support for 8-bit quantization and more importantly there is no support for device mapping.
As you can see first GPU is filled but s…
-
I would like a script to pass a a single instruction and receive an answer.
-
I have noticed that models give the same answers with the same prompt. It seems as if the seed is not randomized.
-
I want to start experimenting more with Retrieval Augmented Generation. As part of that, I want to be able to calculate embeddings against different models.
I want `llm` to grow a `llm embed` comma…
-
Hey all, great work on integrating cuda support for the prompt tokens. How much work would it be to support GPU decoding? Currently on llama.cpp I can reach about 35 tokens per second on llama 7B on a…
-
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
- [X] I am running the latest code. Development is very rapid so there are no tagged versions as of…
-
Trying simple example on m1 mac:
```
from ctransformers import AutoModelForCausalLM
llm = AutoModelForCausalLM.from_pretrained(
"/path/to/starcoderbase-GGML/starcoderbase-ggml-q4_0.bin",
…
-
**Description**
Support for loading 4bit quantized MPT models
**Additional Context**
Occam released it, and added support for loading it to his GPTQ fork and his KoboldAI fork, which may be u…
-
### Describe the bug
Yesterday, this was working perfectly fine. However, I decided to update it using the "update_windows.bat" file, and now I can't get any model to run. The main model I am trying …
-
I would like to be able to decode a sequence of token ids incrementally in a decoder-agnostic manner. I haven't found a straightforward way to do this with the current API - the first token is treated…