-
Dear torchtitan team, I have a question regarding gradient norm clipping when using pipeline parallelism (PP) potentially combined with `FSDP/DP/TP`.
For simplicity, let's assume each process/GPU h…
-
When I am trying to train a model with FSDP, I am getting following error.
*** TypeError: isinstance() arg 2 must be a type, a tuple of types, or a union
It is happening on this specific line
…
-
## Issue description
When I am running distributed and I simply set `CUDA_VISIBLE_DEVICES` in each rank:
- Running `torch.distributed.barrier()` makes rank 1 occupy GPU memory on the GPU of rank 0…
-
Some transforms, notably FSDP and TensorParallel ones, change shapes, but currently do not completely update them (it does for the linear that follows, but not for the activation etc.).
We might con…
-
FutureWarning: using `--fsdp_transformer_layer_cls_to_wrap` is deprecated. Use fsdp_config instead
-
Running the script [3.test_cases/10.FSDP/1.distributed-training.sbatch](https://github.com/aws-samples/awsome-distributed-training/blob/main/3.test_cases/10.FSDP/1.distributed-training.sbatch) on 2 p5…
nghtm updated
1 month ago
-
## ❓ Questions and Help
Fsdp can be well expressed by spmd, but hsdp seems to be unable to be expressed. Is there any way to express hsdp in spmd?
-
I have 6 4090 GPUs (VRAM = 120GB). However, when I try to finetune the model, it shows "CUDA out of memory" error.
How much VRAM is needed to train the ViT backbone model? I want to know how many GPU…
-
Currently our FSDP implementation uses JAX's sharding stuff, which requires that the embed axis be divisible by the number of devices (or really data axis size)
Usually this is fine, but recently @…
dlwh updated
4 months ago
-
@svenstaro
I would like to ask why my GPU memory usage will be lower than FSDP mode when I am using DP mode?
model = DP(model)
`model = FSDP(model,auto_wrap_policy=my_auto_wrap_policy,
…