NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.37k stars 1.39k forks source link

DDP Failed when using the parameters directly to calculate the loss. #436

Open beckybai opened 5 years ago

beckybai commented 5 years ago
# the 'WORLD_SIZE' environment variable will also be set automatically.
 args.distributed = False
 if 'WORLD_SIZE' in os.environ:
     args.distributed = int(os.environ['WORLD_SIZE']) > ### ### **1**

 if args.distributed:
     # FOR DISTRIBUTED:  Set the device according to local_rank.
     torch.cuda.set_device(args.local_rank)

     # FOR DISTRIBUTED:  Initialize the backend.  torch.distributed.launch will provide
     # environment variables, and requires that you use init_method=`env://`.
     torch.distributed.init_process_group(backend='nccl',
                                          init_method='env://')

 torch.backends.cudnn.benchmark = True

 N, D_in, D_out = 64, 1024, 1

 # Each process receives its own batch of "fake input data" and "fake target data."
 # The "training loop" in each process just uses this fake batch over and over.
 # https://github.com/NVIDIA/apex/tree/master/examples/imagenet provides a more realistic
 # example of distributed data sampling for both training and validation.
 x = torch.randn(D_in, device='cuda')
 y = torch.randn(D_out, device='cuda')

 model = torch.nn.Linear(D_in, D_out).cuda()
 optimizer = torch.optim.SGD(model.parameters(), lr=1e-3)

 model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

 if args.distributed:
     # FOR DISTRIBUTED:  After amp.initialize, wrap the model with
     # apex.parallel.DistributedDataParallel.
     model = DistributedDataParallel(model,delay_allreduce =False)
     # torch.nn.parallel.DistributedDataParallel is also fine, with some added args:
     # model = torch.nn.parallel.DistributedDataParallel(model,
     #                                                   device_ids=[args.local_rank],
     #                                                   output_device=args.local_rank)

     # print(model.callback_queued)
 loss_fn = torch.nn.MSELoss()
 print(y)

 for t in range(500):
     optimizer.zero_grad()
     #y_pred = model(x)
     loss = loss_fn(model.module.weight, x.view(1,-1))
     print(loss)
     with amp.scale_loss(loss, optimizer) as scaled_loss:
         scaled_loss.backward(retain_graph=False)
     optimizer.step()

 if args.local_rank == 0:
     print("final loss = ", loss)

This code leads to an attributeError.

AttributeError: 'DistributedDataParallel' object has no attribute 'needs_refresh'

Main reason is that no explicit feedforward in this code, so there is no chance to collect the parameters.

noirmist commented 5 years ago

Did you solve the issue?

I'm working on implementing GradNorm on Apex. And I also had this issue. My training stuck when I access the model.parameters()

Here is the detail code

    print("stuck")
    sys.stdout.flush()
    # Getting gradients of the first layers of each tower and calculate their l2-norm 
    param = list(model.parameters())
    print(param[0])
    sys.stdout.flush() # param[0] is not shown.

    G0R = torch.autograd.grad(l0, param[0], retain_graph=True, create_graph=True)
    G0 = torch.norm(G0R[0], 2)
    print("stuck0")
    sys.stdout.flush()

    G1R = torch.autograd.grad(l1, param[0], retain_graph=True, create_graph=True)
    G1 = torch.norm(G1R[0], 2)
    print("stuck1")
    sys.stdout.flush()

    G2R = torch.autograd.grad(l2, param[0], retain_graph=True, create_graph=True)
    G2 = torch.norm(G2R[0], 2)
    print("stuck2")
    sys.stdout.flush()
xieydd commented 5 years ago

@noirmist Hey, did you solve the problem?

xieydd commented 5 years ago

All, I solve the problem; Linked the issue #457