-
Streamed responses [don't include usage info in the response](https://github.com/openai/openai-cookbook/blob/main/examples/How_to_stream_completions.ipynb). Would have to calculate this via [tiktoken]…
-
**Describe the bug**
Can't train with multiple VM's; TPU v-4-32
It stops after loading the model, won't even load the data
Been trying for two days, maybe my set-up is wrong.
Really want to know w…
-
I noticed that currently only a few series of models, including **Qwen, ChatGLM, and GPT**, support **IFB**. The lack of support for other models has severely impacted the practicality of the TRT-LLM …
-
Have everything running on python3.10 under ubuntu 22.04 with 2x 24 gig gpus.
Tested original and revised versions of 'mt_bench.jsonl' and output is good with a 70b 4bit gptq model.
Trying to incr…
-
For example: https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/PyTorch-Models/Model/qwen1.5/generate.py
The current inference output is generated all at once.
However, t…
-
### Feature request
add option to stream output from pipeline
### Motivation
using `tokenizer.apply_chat_template` then other stuff then `model.generate` is pretty repetitive and I think it's time …
-
Poking around I see that Riva says it is only supported up to 5.1, but there are examples in these containers using it and these containers all work with DP6 so I've been trying to no avail, including…
-
### The model to consider.
https://huggingface.co/THUDM/glm-4-9b-chat
### The closest model vllm already supports.
chatglm
### What's your difficulty of supporting the model you want?
_No respons…
-
When I tested Qwen2-7B on this library, it reported some errors.
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
from intel_npu_acceleration_library import NPUModelForCausalL…
-
### Checklist
- [X] 1. I have searched related issues but cannot get the expected help.
- [ ] 2. The bug has not been fixed in the latest version.
- [X] 3. Please note that if the bug-related issue y…