Can we merge two checkpoints?

Holmeyoung / crnn-pytorch

Pytorch implementation of CRNN (CNN + RNN + CTCLoss) for all language OCR.

MIT License

377 stars 105 forks source link

Can we merge two checkpoints? #29

Closed mariembenslama closed 4 years ago

mariembenslama commented 5 years ago

Hello, I wanted to ask if we could merge two .pth files trained on different dataset of this project?

Thanks.

Holmeyoung commented 5 years ago

Hi. Long time no reply, haha

Maybe we can get the average of the weights.

mariembenslama commented 5 years ago

Thanx and Long time no see,

My question is:

Which layer (of both models) should we get its weights?
When we get the weights, where do we store them?
Lastly, how do we generate a .pth file from two .pth files?

Holmeyoung commented 5 years ago

Hi, different people have different classes, so we need to modify the last lstm layer. We needn’t to store the weights, we just need to load the weights like we normally do and the difference is we should get the average of two models’ weights.

mariembenslama commented 5 years ago

I see, and then, where do we store the weights after that? In the .pth file right? Also will calculating the average drop the performance? What do you think?

Holmeyoung commented 5 years ago

I don't know why you want to merge the weights of two models. I think it will drop the performance. If you want to train base on the merged model, then we need't to save, or we can save it. In fact, a a.pth is just the structure and the weights of the model. So you are right, we can save it as a .pth file.

mariembenslama commented 5 years ago

I thought I can train in different google colab accounts and then I'll merge the checkpoints to accelerate the learning.

But thanks for the explanation again and again ^_^ !

Holmeyoung commented 5 years ago

You have a good idea, but distributed training isn’t just as simple as merging the weights of two models. The performance isn’t 1+1>2, it will more likely to be 1+1<2. If the reason you want to train on two accounts is the time limits, you can use part of the data to train first and then use the rest data and the first stage pretrained model to train the final model.

mariembenslama commented 5 years ago

I see, thanks. But isn't training part of the data = drawing an incomplete curve of some features only?

Holmeyoung commented 5 years ago

More data is used to prevent overfitting.