-
[GPTQ](https://arxiv.org/abs/2210.17323) is currently the SOTA one shot quantization method for LLMs.
GPTQ supports amazingly low 3-bit and 4-bit weight quantization. And it can be applied to LLaMa.
…
-
### System Info
```shell
- `transformers` version: 4.20.0.dev0
- Platform: Linux-5.13.0-44-generic-x86_64-with-glibc2.29
- Python version: 3.8.10
- Huggingface_hub version: 0.5.1
- PyTorch versio…
-
### Branch/Tag/Commit
main
### Docker Image Version
nvcr.io/nvidia/pytorch:22.07-py3
### GPU name
T4
### CUDA Driver
470.57.02
### Reproduced Steps
```shell
Background: I t…
-
### Description
When using GPT model and T4 GPU with triton server, setting request_prompt_lengths causes to leak the previous inference's response.
In the second request, its response contains th…
-
**Describe the bug**
While training on The Pile, I was getting errors from sparse attention, claiming that the sequence length wasn't divisible by the block size, despite using a sequence length of…
-
I am trying to train the model and when I run:
`python ./deepy.py train.py ./configs/small.yml ./configs/local_setup.yml`
I get the error:
`NeoXArgs.from_ymls() ['./configs/small.yml', './configs/l…
-
**Describe the bug**
Unable to convert a custom gpt-neox model (with zero stage 3) checkpoints using zero_to_fp32.py script.
**To Reproduce**
Train a model with zero stage 3, pp=0, mp=1 (haven't …
-
I tried WebLLM the other week and was really blown away. I have an Intel macOS system with AMD 6900XT GPU and using WebLLM was the first time I'd had decent GPU inference on this system.
Now I'd lo…
-
hi,
I use examples/pytorch/gptneox/utils/eleutherai_gpt_neox_convert.py convert GPT-NeoX model, but there have no model.wpe.bin. However, it must need a model.wpe.bin file when inference with gpt_…
-
### Discussed in https://github.com/triton-inference-server/fastertransformer_backend/discussions/48
Originally posted by **SnoozingSimian** September 22, 2022
While loading both GPTJ and GPT-…