code seems do not support resume training from a saved weight?

qq184861643 commented 5 years ago

I was training a model using CASIA-Webface and it stopped in some where of total epochs accidentally. So I've add some lines in Learner.py and tried to resume training but got failed. here is my resuming code:

 def train(self, conf, epochs,resume=False,fixed_str=None):
        self.model.train()
        conf.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
        self.model = nn.DataParallel(self.model)
        self.head = nn.DataParallel(self.head)
        self.model.cuda()
        self.head.cuda()
        start_epoch=0
        if resume==True:
            if not fixed_str:
                raise ValueError('must input fixed_str parameter!')
            self.load_state(conf,fixed_str)
            self.step = int(fixed_str.split('_')[-2].split(':')[1])+1
            start_epoch = self.step//len(self.loader)
            self.step = start_epoch*len(self.loader)+1
            print('loading model at epoch {} done!'.format(start_epoch))
            print(self.optimizer)
        running_loss = 0.
        dc_loss = 0.
        bceloss_func = nn.BCELoss()
        for e in range(start_epoch,epochs):
            print('epoch {} started'.format(e))
            if e == self.milestones[0]:
                self.schedule_lr()
            if e == self.milestones[1]:
                self.schedule_lr()
        #nothing changed below

I've changed nothing below. The wired thing is that when I load whichever the weights of all of model, head and optimizer and then continue training, I got a very high CELoss. I tested it in a ipython notebook. When I random initialize a learner, I got CELoss around 45, but when I load a weights(which get a 93% acc on LFW) for the learner I got CELoss around 77. I think the problem lies in the logic of class Arcface in Learner.py but I am not sure. If anyone could help me figure out the issue?

boomberung commented 5 years ago

Resuming training works well for me. I didnt change anything in train method and use Arcface head.

qq184861643 commented 5 years ago

Resuming training works well for me. I didnt change anything in train method and use Arcface head.

@boomberung thx. Then maybe my problem lies in nn.DataParallel. I will try it later.

boomberung commented 5 years ago

@qq184861643 When I change ArcFace to my own head I have same issue like you. Random initial loss is ~45, but when I resume training from weights loss start from ~50 (and lfw acc is 94%).

DecentMakeover commented 5 years ago

Hi Unrelated to your question maybe, but i wanted to perform face verification , and the current arcface architecture does not perform really well on my dataset, is it possible to fine tune the model, with my custom dataset?

Thanks in advance

qq184861643 commented 5 years ago

@boomberung Hi! have you figured it out how to solve this? I've tried several methods but still can't fix it

qq184861643 commented 5 years ago

@DecentMakeover if we can't solve the resuming issue I don't think fine-tuning is possible

boomberung commented 5 years ago

@qq184861643 No, but I found that even with the curve loss function, the network is learning normally. And I think the problem is with this line "loss_board = running_loss / self.board_loss_every"

LaviLiu commented 5 years ago

@qq184861643 @boomberung Have you solved the problem about resuming?

sangtv9 commented 3 years ago

Hi Unrelated to your question maybe, but i wanted to perform face verification , and the current arcface architecture does not perform really well on my dataset, is it possible to fine tune the model, with my custom dataset?

Thanks in advance

Yes, it is possible. But, I can not get high accuracy when training on my custom dataset. Have you got any idea to solve it?

TreB1eN / InsightFace_Pytorch

code seems do not support resume training from a saved weight? #38