Open abdulmuneer opened 4 months ago
hello, we haven't added it here yet, but there's an unofficial example here: https://github.com/garrett361/determined/tree/scratchwork/scratchwork/fsdp_min
For the context, PytorchTrial
does not support FSDP and there're no plans to add that. For FSDP, you should use Core API instead, and it'll be effectively the same as the torch DDP: standard torch distributed launcher works the same, metrics logging and hpsearch work the same. if you checkpoint full model from rank=0, it'll work the same as well. if you want to do sharded checkpointing, use the sharded checkpointing shard=True
option.
Hi, Does Determined support the PyTorch FSDP way of distributed training? I can see examples for DeepSpeed, but I have a requirement to specifically use native FSDP feature of PyTorch 2.2 (something like https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?highlight=pre%20training).