-
Hello,
First I'll say, really impressed by this library and looking forward to TTS!
I ran the example project on my android pixel 7 (Same one you used) and I am not seeing the same performance t…
-
With limited memory on most of phones, there's community requests on supporting a model with a smaller size like Phi-3 mini. It may be supported out of box, but need to verification, evaluation and pr…
-
Hi,
I am trying to use this framework on causal models such as llama based models and other LLMs. For my case, I use Tinyllama and Pythia to replace the T5 model in the original pipeline (TinyLlam…
-
hi unslothai, i got different inference result when using unsloth, i'v tested qwen1.5-chat and tinyllama-chat and got same issue, generate by unsloth always get a bad result compare with transformers …
-
Hi @ilur98, thanks for your great work on this repository. I am attempting to modify your work to support W8A8 as I found that static W4A8 represents gives too large of a quantization error.
I am r…
-
I would like to request 1 or 2 examples of how to adapt this for a popular open models, such as:
https://huggingface.co/mistralai/Mistral-7B-v0.1
https://huggingface.co/meta-llama/Llama-2-7b-hf
h…
-
-
**Description**
When a user performs a long-running inference request via HTTPServer, they may lose connection or intentionally abort the connection (ctrl-c from curl).
Ideally, the HTTP server will…
-
### Summary
This only works when the available RAM is several times the model. I think we could demo a PoC using TinyLlama.
1 Start several api server instances. Each on a different port.
2 S…
-
tinyllama which finetuning by function calling