-
I tested the inference speed of LLaMa-7B with bitsandbutes-0.40 on A100-80G. I fonud that the speed of `nf4` has been greatly improved thah Qlora. However, the speed of `nf4` is still slower than `fp1…
-
I attempted to fine-tune a 6 billion parameter model using 8 A100 GPUs, but the training process encountered interruptions. On the first attempt, it stopped at 0.15 epochs, and on the second attempt, …
-
Current Lambda zip deployment has a size limit of 250mb, which limits the use of large pre-trained models in the similarity engine Lambda deployment on AWS Cloud Environment. After resesarch, I will e…
-
### Context
This issue proposes adding a test to the [post-training compression conformance suite](https://github.com/openvinotoolkit/nncf/blob/develop/tests/post_training/README.md) to verify that t…
-
### System Info
Hi, I am using LLMChainFilter.from_llm(llm) but while running, I am getting this error:
ValueError: BooleanOutputParser expected output value to either be YES or NO. Received Yes, …
-
Thanks for the great work!
I want to recommend a new method to perform KD: inherited weight.
Name: Weight-Inherited Distillation for Task-Agnostic BERT Compression
code: https://github.com/wutai…
-
### Bug Description
The llamaindex RAG demo is no longer functioning properly due to significant changes in library calls after updating llamaindex to version 0.10. Could you help me troubleshoot whe…
-
我用两张A6000 96GB和两张GV100 尝试运行LLama 模型,但是cuda报错
单卡bert是能够正常运行,但是一旦切换到双卡就在soure_embedding前向传播部分开始报错
source embeddings = self.mapping_layer(self.word embeddings.permute(1, 0)).permute(1, 0)
报错如下,请问有碰见…
-
### Your current environment
The output of `python collect_env.py`
```text
Collecting environment information...
PyTorch version: 2.4.0+cu121
Is debug build: False
CUDA used to build PyTorch…
-
We currently have about 7500 hours of oral argument audio without transcriptions. We need to go through these audio files and run a speech to text tool on them. This would have massive benefits:
- Ale…