determined-ai / determined-examples

Example ML projects that use the Determined library.
https://github.com/determined-ai/determined
Apache License 2.0
14 stars 1 forks source link

Requesting example to use PyTorch FSDP #19

Open abdulmuneer opened 4 months ago

abdulmuneer commented 4 months ago

Hi, Does Determined support the PyTorch FSDP way of distributed training? I can see examples for DeepSpeed, but I have a requirement to specifically use native FSDP feature of PyTorch 2.2 (something like https://pytorch.org/tutorials/intermediate/FSDP_adavnced_tutorial.html?highlight=pre%20training).

ioga commented 4 months ago

hello, we haven't added it here yet, but there's an unofficial example here: https://github.com/garrett361/determined/tree/scratchwork/scratchwork/fsdp_min

For the context, PytorchTrial does not support FSDP and there're no plans to add that. For FSDP, you should use Core API instead, and it'll be effectively the same as the torch DDP: standard torch distributed launcher works the same, metrics logging and hpsearch work the same. if you checkpoint full model from rank=0, it'll work the same as well. if you want to do sharded checkpointing, use the sharded checkpointing shard=True option.