Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
574 stars 147 forks source link

Unable able to Train (SVHN and FSNS) #51

Open Devadattaprasad opened 5 years ago

Devadattaprasad commented 5 years ago

Hi Bartzi,

I want to run the train_svhn.py, I have prepared the data set as per the instruction given in Read me file for running train_svhn.py. As per #20 , because of alignment issue in gt file , training is not running. I made necessary changes in the gt file shown below here

I tried with train both the train_svhn.py and train_fsns.py, It remain in 1st iteration ( screenshot for reference) screenshot

I am not able to understanding, Why training is not continuing.

Argument list for reference image

Kindly help me to resolve the issue. Thanks in advance

Bartzi commented 5 years ago

I think I now what the problem is. You can try to change this and the following line to use MultiThreadIterator or SerialIterator.

I also had this problem on some machines and datasets, something seems not to be right in the interprocess communication.

Devadattaprasad commented 5 years ago

@Bartzi , Thank You. After changing code "chainer.iterators.MultiprocessIterator" to 'chainer.iterators.MultithreadIterator' , above issue got resolved.

I want prepare my own character map. Kindly let me know the process of creating my own "character_map".

Thank you in advance. :-)

Bartzi commented 5 years ago

Creating a new character_map is actually quite easy. Have a look at this comment. There I explain the idea behind the char map. This should give you all information you need to create a char map on your own =)

Devadattaprasad commented 5 years ago

Thank you, Bartzi, for quick reply.

I have few doubt, When, I am running a evaluation for fsns.

  1. What is the reason for defining TimeStamp, As per my understanding, TimeStamp variable is used to define how many words need to predict. "say TimeStamp = 1 it will predict 1 work and so on". Please correct me,If I am wrong. Below image, ran the evaluation code with parameters as shown below. image

Expected Result 4 bounding box and the predicted word. image

But I got 4 bounding box, for 3 box drawn on words and 1 bounding box is overlapping on 3rd box. Predicted 3 words. But as there are more than 3 words are in the image.

  1. What is the reason for limiting TimeStamp between 1 to 6, if I define TimeStamp with value more the 6 it will throw error as shown below image

  2. I want to understand ,weather localization will be done for complete image or in specific part of the image. As in the above result. There are more that 3 words, but it only predicted 3 words, which are in specific part and ignored remaining words

  3. Assume, In real word scenario. we do not know how many words in the scene, How will you counter such scenarios.

  4. What if the Scene does not have any text at all.

Please correct me, if my understanding is wrong.

Bartzi commented 5 years ago

Alright,

  1. We need to define a number of timesteps because we do not know how many text regions are in the image. So if we set the number of timesteps to 4, we can localize a maximum of 4 text regions, if there are less regions, we hope that the network learned to recognize this and predict nothing for the remaining time steps. Lets talk about the example you showed. The model is trained to recognize the text on the street name signs. While being trained the network learns to ignore text that is not that close to the center of the image because this text is just noise. This is why the model does not localize these text regions. It draws a fourth box because it was run for 4 timesteps, but it found out that there are only three words on the sign (which you can also see in the prediction result), so it actually does not matter where the fourth is, because you could also just very easily prune this result from the end result.
  2. The reason for limiting the number of timesteps to 6 is twofold:
    • the FSNS dataset does not contain any images with more than 6 words, so it does not make sense to train a model on a task more difficult than it needs to be
    • it is quite difficult to train a model with more time steps, so managing 6 is already very challenging (we actually did not train a model with 6 time steps and just dealt with the case that those images are always predicted wrongly, as there are only a few images with 6 words in them)
  3. Localization will be done for the complete image, but the network decides itself what to focus on. For the FSNS dataset this is learned behaviour, as the model ignores noise.
  4. In a real world scenario we'll have to set a reasonable number of max text regions and hope that this covers most of the cases. This is the way it is right now. Further research is necessary to improve this!
  5. If the scene does not have any text at all, the model would predict bounding boxes at random locations and in the best case predict no text at all, by looking at the predicted text you know whether there is text or not.