baidu-research / tripmaster

Apache License 2.0
2 stars 0 forks source link

PaddleDDPMachine does not call DataParallel.forward but Machine.forward, causing error in DDP training #2

Open rudaoshi opened 1 year ago

rudaoshi commented 1 year ago

In the SuperviseOperator, self.machine.forward_with_validation is called. However, when the machine is PaddleDDPMachine, accoding to following code, getattr(self.module, forward_with_validation) will be called, then the calling self become the original machine, not the DataParallel wrapped machine.

class PaddleDDPMachine(paddle.DataParallel):

    def __init__(self, machine, *args, **kwargs):

        super().__init__(machine, *args, **kwargs)

    def __getattr__(self, name):

        try:
            return paddle.DataParallel.__getattr__(self, name)
        except AttributeError as e:
            if hasattr(self.module, name):
                return getattr(self.module, name)
            else:
                print(f"what happens {self.__dict__}")
                raise e