Holmeyoung / crnn-pytorch

Pytorch implementation of CRNN (CNN + RNN + CTCLoss) for all language OCR.
MIT License
378 stars 105 forks source link

Accuracy and epochs #26

Closed mariembenslama closed 5 years ago

mariembenslama commented 5 years ago

Hello, long time no see :D

I wanna ask, (probably I did ask the same question before but I forgot the answer sorry ^^" lol ),

When I train the model (on about +7000 Japanese, english characters - 10M train samples and 1M test samples).

The accuracy gets high (about 50% while still in epoch 0 - let's say it has entered 5k images per pre-epoch), the loss is low 0.03 and still decreases though - However when giving it a real life image case (the same as the test sample) it makes grave guessings (lol).

What do u think is the problem? Should I kill the process I mean? or wait for the epochs to finish?

Holmeyoung commented 5 years ago

Hi, i am sorry to reply so late.

  1. 10M train sample is the number or the space of the data?
  2. What's the accuracy of your model can finally get to?
  3. If the model can get 95%+ acc on the test data, you can try to use the model to predict a real life image case and tell me the result, i can analyse it for you~~~
Holmeyoung commented 5 years ago

I fix the bug of this project these days, so i didn't check the github. You can pull the latest code to test~Thank you~

mariembenslama commented 5 years ago
  1. 10 Train samples is the number of images.
  2. Because of some interruptions (closing the machine and so on), I have these checkpoints:
    • model_0_84000.pth
    • model_0_1000.pth
    • model_0_63000.pth
      Test loss: 0.000928, accuray: 0.9207813.
  3. The screenshot (the result of a one example) [ the image is on gray and the result is 爺丶値巛・孃働纎緻 ]: Capture d’écran de 2019-09-02 10-21-57
Holmeyoung commented 5 years ago

Hi mariem, you can pull the latest code. In the origin code, the val iter=max(100, len(val_dataloader)) so, it's always 100. And this can't always show the performance on the whole dataset.

mariembenslama commented 5 years ago

Hello, thanks for the answer,

Can you explain this detail more please?

Holmeyoung commented 5 years ago

Just i guess the Test loss: 0.000928, accuray: 0.9207813 is not valed on the whole val dataset, but on 100 iters~ So maybe the value is not so accuracy~

mariembenslama commented 5 years ago

Ohhh, I see; yes I have always seen that and didn't talk about it lol

However emmm, I'm always wondering even after changing that, what makes the model accurate? I mean what will always boost the accuracy of the real life images? Is it the data?? The no rotation of the real life images? I want to know the strong feature that should be there to make all images accurate 😅

Holmeyoung commented 5 years ago

Ohhhh, because you get 92% acc but the model do bad in real image. So, we can change this to see what's the real performance on whole val data. As for boosting the accuracy, if there is no rotation in real image, then the rotation in training data is useless(i think so). Anyway, more data is always the best way~~~

mariembenslama commented 5 years ago

Also do we have to complete all the epochs or only stop if we get good accuracy (now with the new code)?

mariembenslama commented 5 years ago

Also should I re-train again from my checkpoints? (the one that has 92%) or re-start the training again?

Holmeyoung commented 5 years ago

Also, All the epochs number is 1000, it's toooooooo big. Just stop when the val loss does's decrease and the val acc doesn't increase. Also, We can just re-train again from the checkpoints. The model structure did't change.

mariembenslama commented 5 years ago

Alright! thank you very much! :)

mariembenslama commented 5 years ago

Look, I just retried again and the first 100 samples got this: Test loss: 0.000512, accuray: 0.907031

Holmeyoung commented 5 years ago

Did you get the latest code and val on the whole dataset~

mariembenslama commented 5 years ago

Yes i did, also I changed it to my alphabets file and all, the train root is the 10M and the val is the 1M. The 2nd 200 got: Test loss: 0.000465, accuray: 0.921875

Holmeyoung commented 5 years ago

If the real image is just the same as val image, why there is so much difference?! Now, just wait for the val acc to get 99+ and then test the result.

mariembenslama commented 5 years ago

You see, the image (the training images are in .jpg form) while the real life are in .png = I just noticed it, would it affect the result?

Holmeyoung commented 5 years ago

Haha, it shouldn't affect the result. Did your real life and training image are the same? I always think they are not the same or the model shouldn't do so bad on real life image.

mariembenslama commented 5 years ago

They are as you saw them (sent you before), emmm I guess I shall re-read the code probably something is wrong in my regard.

mariembenslama commented 5 years ago

And I'll contact you again, thanks, but just wait xD

mariembenslama commented 5 years ago

I retried the work again and it's stuck in the first 1000 it doesn't show the val test samples (??) is it because my pc is slow?

Holmeyoung commented 5 years ago

No! It's because it will val the model on whole val data every 1000 iter.

mariembenslama commented 5 years ago

You mean every 1000 train samples passed and learned in the model, it will val the whole val data? But will that change something wether it'll test the whole val data or the small samples?

Holmeyoung commented 5 years ago
if __name__ == "__main__":
    for epoch in range(params.nepoch):
        train_iter = iter(train_loader)
        i = 0
        while i < len(train_loader):
            cost = train(crnn, criterion, optimizer, train_iter)
            loss_avg.add(cost)
            i += 1

            if i % params.displayInterval == 0:
                print('[%d/%d][%d/%d] Loss: %f' %
                      (epoch, params.nepoch, i, len(train_loader), loss_avg.val()))
                loss_avg.reset()

            if i % params.valInterval == 0:
                val(crnn, criterion)

            # do checkpointing
            if i % params.saveInterval == 0:
                torch.save(crnn.state_dict(), '{0}/netCRNN_{1}_{2}.pth'.format(params.expr_dir, epoch, i))

The params.valInterval = 1000 The whole val data can represent the real data better. And the acc on the whole val data can replect the model performance better.