-
This is a bit of a technical challenge and/or question. Both I-JEPA and V-JEPA use DDP and not FSDP. This puts an inherent cap on the size of models that are used, the size of the GPU memory.
I'm …
-
### Describe the bug
I run the training but get this error
### Reproduction
Run `accelerate config`
```
compute_environment: LOCAL_MACHINE
debug: true
distributed_type: FSDP
downcast_bf16: '…
kopyl updated
2 weeks ago
-
The old codepath is not composable with other transforms, does not offer gathering of state dicts as easily etc.
Removing, of course depends on NVIDIA benchmarking not needing it. I think we (@crc…
t-vi updated
2 months ago
-
[torch-neuronx] FSDP support - Distributed Training on Trn1
-
FSDP2 supports all-gather using FP8:
https://discuss.pytorch.org/t/distributed-w-torchtitan-enabling-float8-all-gather-in-fsdp2/209323
Wondering if we could do this directly using TransformerEngine …
-
### Bug description
When using the FSDP strategy with HYBRID SHARD set, the loss behaves as if only one node is training. When it is set to FULL_SHARD/etc the loss drops as expected when more nodes a…
-
I'm running nccl-test `all-reduce` between two nodes, and I've found that the tree algorithm performs much better than the ring algorithm. However, through reading the NCCL source code, I noticed tha…
-
### 🐛 Describe the bug
Simple compilation of UNet model works fine, but FSDP-wrapped UNet gets recompiled on every block. In real setup cache-size limit is rapidly reached.
Code:
```
import argp…
-
When adding the NVMe drives, we changed what cards are installed in nodes 01 and 02, and also removed the bifurcating PCI-e card from the Mellanox cards in nodes 09 and 10. We need to update the mach…
-
I was wondering if PyTorch's FullyShardedDataParallel (FSDP) is supported by TransformerEngine , especially if FP8 can work with FSDP. Thank you in advance.