amdegroot / ssd.pytorch

A PyTorch Implementation of Single Shot MultiBox Detector
MIT License
5.14k stars 1.74k forks source link

Bug in train.py #174

Open bobo0810 opened 6 years ago

bobo0810 commented 6 years ago

In the 165 or so lines of code: Images,target = next(batch_iterator)

Bug Description: The above code can only read the data set once. After reading through the data set, the program will stop.

Solution: The code is changed to:      # load train data   Try:       Images,target = next(batch_iterator)   Except StopIteration: #Start a new iteration       Batch_iterator = iter(data_loader)       Images,target = next(batch_iterator)

visor2020 commented 6 years ago

You should reload data. what I mean is that you should copy data_loader = data.DataLoader(dataset, args.batch_size, num_workers=args.num_workers, shuffle=True, collate_fn=detection_collate, pin_memory=True) to the except.

chi0tzp commented 6 years ago

As @bobo0810 mentioned, this bug is because the batch iterator run through the whole dataset eventually. So, I guess this means that it happens after exactly one epoch? Do I miss something here?

I just wonder why the developers did not implement an approach like for epoch in range(num_epochs): .... Wouldn't that make better sense?

bobo0810 commented 6 years ago

@visor2020 After testing, in fact, do not need to reload the data set。

visor2020 commented 6 years ago

@bobo0810 however, the weights is not trained and saved by yourself. Is this a fact?

shufanwu commented 6 years ago

You can see the code in line 122 of voc0712.py, return len(self.ids), the iterator only iterate once, so there are two methods to solve this problem: 1.replacing iterator with the form mentioned by @chi0tzp 2.modifying the function len() in voc0712.py with def __len__(self): return self.total_images the self.total_images is (cfg['max_iter'] - args.start_iter) * args.batch_size

YingdiZhang commented 6 years ago

@bobo0810 you should change batch_iterator = iter(data_loader) to batch_iterator = None and then add this to the beginning of the for loop: if (not batch_iterator) or (iteration % epoch_size == 0):

create batch iterator

        batch_iterator = iter(data_loader)
hust-kevin commented 6 years ago

@bobo0810 @visor2020 hey! I have a question in train.py ,why there sometimes use 'ssd_net' sometimes 'net': ssd_net = build_ssd('train', cfg['min_dim'], cfg['num_classes']) net = ssd_net

if args.cuda:
    net = torch.nn.DataParallel(ssd_net)
    cudnn.benchmark = True

if args.resume:
    print('Resuming training, loading {}...'.format(args.resume))
    ssd_net.load_weights(args.resume)
else:
    vgg_weights = torch.load(args.save_folder + args.basenet)
    print('Loading base network...')
    ssd_net.vgg.load_state_dict(vgg_weights)  #

if args.cuda:
    net = net.cuda()

if not args.resume:
    print('Initializing weights...')
    # initialize newly added layers' weights with xavier method
    ssd_net.extras.apply(weights_init)
    ssd_net.loc.apply(weights_init)
    ssd_net.conf.apply(weights_init)

optimizer = optim.SGD(net.parameters(), lr=args.lr, momentum=args.momentum,
                      weight_decay=args.weight_decay)
criterion = MultiBoxLoss(cfg['num_classes'], 0.5, True, 0, True, 3, 0.5,
                         False, args.cuda)

I can't understand it,can you help me ? thank you

bobo0810 commented 6 years ago

@hust-kevin torch.nn.DataParallel return a new model for multi-gpu

kuaitoukid commented 5 years ago

DataParallel

So why not use "ssd_net = torch.nn.DataParallel(ssd_net)"

zhaohao0404 commented 4 years ago

Comments are quite wonderful and usefull.