I am using distributed training with FastDP and have questions about its integration with Deepspeed. This is my first time using Deepspeed, and I apologize if some of these questions are trivial:
Are all the three stages of Deepspeed zero necessary for distributed DPSGD training?
The image classification examples have two python files for stage 1 and stage 2and3. Are both of them private?
For stage 1, privacyengine.attach() is not recommended. Then, how is the dp_step() called?
The requirements file mentions older versions of torch and deepspeed. Have you tested them on any newer versions?
Thank you for using FastDP! To answer you questions:
You only need 1 of three stages for distributed DPSGD. As you move up the stage, you trade off the communication time for memory efficiency (i.e. slower but can train larger models).
Yes they are all private and should give you similar results.
For all stages, we don't need .attach() or dp_step. The modification is on the grad, not on the optimizer, so that regular step() would work directly.
I haven't tested on newer deepspeed version. It should work on torch>=2.2.0.
I am using distributed training with FastDP and have questions about its integration with Deepspeed. This is my first time using Deepspeed, and I apologize if some of these questions are trivial:
Thank you!