-
This is a bit of a technical challenge and/or question. Both I-JEPA and V-JEPA use DDP and not FSDP. This puts an inherent cap on the size of models that are used, the size of the GPU memory.
I'm …
-
I am trying to run single GPU to multinode distributed fine tuning for Llama3-70B and Llama3 8B Models.
Below is my training configuration:
SFT (Llama3 8B & 70B)
Epochs: 3
Gradient Accumulatio…
-
### 🐛 Describe the bug
when trying to train both LoRA layers on the base model and also set modules_to_save on the lora config which makes the embeddings layers trainable (my assumption is it also ap…
-
### System Info
- `transformers` version: 4.45.0.dev0
- Platform: Linux-5.15.0-1027-gcp-x86_64-with-glibc2.31
- Python version: 3.9.19
- Huggingface_hub version: 0.24.5
- Safetensors version: 0…
-
### System Info
- `transformers` version: 4.40.1
- Platform: Linux-5.15.148.2-2.cm2-x86_64-with-glibc2.35
- Python version: 3.10.2
- Huggingface_hub version: 0.23.0
- Safetensors version: 0.4.2…
-
Hi the team, great work!
QDoRA seems to be better than QLoRA, refer to [Efficient finetuning of Llama 3 with FSDP QDoRA](https://www.answer.ai/posts/2024-04-26-fsdp-qdora-llama3.html)
I wonder w…
-
### 🐛 Describe the bug
Flex attention on FSDP works without compile, but not with compile. The key error seems to be `ValueError: Pointer argument (at 2) cannot be accessed from Triton (cpu tensor?)`…
-
### 🚀 The feature, motivation and pitch
In FSDP1 there is the `FSDP.summon_full_params` [function](https://pytorch.org/docs/stable/fsdp.html#torch.distributed.fsdp.FullyShardedDataParallel.summon_ful…
-
Thanks for excellent work!
When I try to train a 4step SDXL model.(2 nodes 16 GPUs ) I got an error:
`[rank2]: Traceback (most recent call last):
[rank2]: File "/mnt/nas/gaohl/project/DMD2-mai…
-
### ❓ The question
quick question, is there any example script and yaml file that turn off FSDP completely? (I want to use DDP)
I am running it on a 7B model. I have A100 80GB. I guess this w…