-
I am trying to use Trt_llm rag with Mistral 7B model.
I have used int8 weight-only quantization during the building of the TRT engine.
The app launches but drops an error when an input is passed to …
-
基于2台A800x80G训练13B LLaMA模型发现效率只能达到840 token/sec/GPU,不知道是什么原因,详细配置如下:
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 1 \
--sequence-parallel \
--distributed-timeout…
-
### Your current environment
Referring to the issue #5181 "The Offline Inference Embedding Example Fails", the method LLM.encode() can only work for embedding models. Is there any idea to get the ou…
-
@staticmethod
def apply_rotary(x, sinusoidal_pos):
sin, cos = sinusoidal_pos
x1, x2 = x[..., 0::2], x[..., 1::2]
# 如果是旋转query key的话,下面这个直接cat就行,因为要进行矩阵乘法,最终会在这个维度求和…
-
When I use the following code to load LLama2 and generate:
```python
model = HookedTransformer.from_pretrained("meta-llama/Llama-2-7b-chat-hf",
hf_mod…
-
Hi! I'm running an enc-dec transformer with ROPE in the first self-attention layer of the encoder and decoder. I'm noticing that in the eval stage of my model, it hangs until my job times out after ab…
-
I encountered a runtime error while using the transformers-interpret library with a fine-tuned LLama-2 model that includes LoRA adapters for sequence classification. The error occurs when invoking the…
-
Describe the bug
Get an AtrributeError when trying to convert llama3-8B model from HF format to mcore format, the error is below:
`AttributeError: 'Tokenizer' object has no attribute 'vocab_size'`…
-
🎉The finetuning(VQA/OCR/Grounding/Video) for Qwen2-VL-Chat series models has been supported, please check the documentation below for details:
# English
https://github.com/modelscope/ms-swift/blob/m…
-
As part of [SGLang Issue #1487](https://github.com/sgl-project/sglang/issues/1487), SGLang plans to move vLLM to optional dependencies and use flashinfer as the main dependency.
I am working on mo…