-
### 🚀 The feature, motivation and pitch
Libraries like Transformers, vllm and diffusers use large quantized LLMs for inference and fine-tuning. When running large models on several low-memory GPU…
-
With both `flash_attn_varlen_qkvpacked_func` and `CheckpointImpl.NO_REENTRANT` raise Runtime Error below:
```python
Traceback (most recent call last):
> File "/opt/tiger/antelope/train.py", line …
-
### Bug description
I find that when using FSDP strategy, the model parameters and gradients are not logged by WandB. However, everything works well if I switch FSDP to native DDP strategy.
Since …
-
Thanks for the great work and promising performance in model training. are you considering apply and simplify burst-attention on model inference? what gaps are there compared to ring attention with FS…
-
How to infer a batch of encoded tensors (shape = (B, T)) on 4 GPUs, getting 3~4x tokens/s through put compared to on single GPU? (it's for a small model which can be fit into a GPU's mem)
I've tri…
-
Run DDP with a shared buffer (different TorchDynamo `Source`):
Repro Script
```
"""
torchrun --standalone --nproc_per_node=1 test/dup_repro.py
TORCH_LOGS=aot,dynamo torchrun --standalone --…
awgu updated
6 months ago
-
I'm trying to finetune Llama2 70B using a NVIDIA A100 with 80Gb, but even with batch-size = 1 I'm getting OOM error.
I'm using LoRA with quantization this way: `plugins = BitsandbytesPrecision('nf4…
-
I set the environment variables as follow in train_dist.sh in gpt_hf folder:
```
export NUM_NODES=1
export NUM_GPUS_PER_NODE=8
export MASTER_ADDR=localhost
export MASTER_PORT=2222
export NODE_RA…
-
用llama factory进行sft可以使用deepspeed zero2 微调llama3-8B的模型,但这个框架就算batch设为1,用deepspeed zero2也会报OOM。
用zero3训练会变得很慢,出现这个问题:
2 pytorch allocator cache flushes since last step. this happens when there is hi…
-
### 🚀 The feature, motivation and pitch
Since it currently does not, skip decorators such as `skip_if_lt_x_gpu(2)` are not properly handled which was discovered as a part of https://github.com/pytorc…