Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
575 stars 147 forks source link

Training SEE on ICDAR Born Digital Dataset #25

Open rohit12 opened 6 years ago

rohit12 commented 6 years ago

I have a few queries while training SEE on the Born Digital dataset. It is basically flyers and digitally made advertisements.

  1. How do I verify whether training is going correctly? What method did you use for this?
  2. How do I ensure that since there are multiple GT in a single image, that the network is correctly associating GTs with what it is detecting?
  3. In the logs/ folder, the model is not generated. Do you have any idea why? I have 410 images for training and my batch size is 32.
  4. Related to the previous query, is a size of 410 too small considering the parameters of the network?
saq1410 commented 6 years ago

@rohit12 the reason your model file not generated is you may have not set --snapshot-interval argument. By default it is set to 20000. so when your script reaches total of 20000 iteration a snapshot will be generated.

Bartzi commented 6 years ago

Hi,

  1. You can verify that training is going correctly, by having a look at the images located in the bboxes folder in your log_dir. Those images show the predictions of the network on a test image and if that improves over time, it seems to work
  2. The only way to ensure this right now, is to order the GT in a consistent way for each image. Otherwise you will need to find a loss that can work with random alignment. (we always forced the GT to be ordered from left to right and top to bottom)
  3. @saq1410 you are right, that is the problem here
  4. I think 410 images is way to small. There are too many parameters that need to be optimized and the task is not easy at all, so it will be more than difficult for the network to learn. Your network should also overfit heavily on this limited dataset. Maybe you can find a way to generate similar looking data...
saharudra commented 6 years ago

Hi Christian,

We are facing a few other problems for the Born Digital dataset.

1) While creating the the video, we are facing the following error

/src/datasets/BornDigital/logs_new/2018-04-17T02:31:56.662657_training/boxes$ python3 ../../../../../see/utils/create_video.py ./ ./video.mp
4                                                                                                                                                                   
loading images                                                                                                                                                      
sort and cut images                                                                                                                                                 
creating temp file                                                                                                                                                  
convert -quality 100 @/tmp/tmpp5m2rc0n /tmp/tmp65e4ijjz/1000.mpeg                                                                                                   
^BdKilled                                                                                                                                                           
Traceback (most recent call last):                                                                                                                                  
  File "../../../../../see/utils/create_video.py", line 109, in <module>                                                                                            
    make_video(args.image_dir, args.dest_file, batch_size=args.batch_size, start=args.start, end=args.end, pattern=args.pattern)                                    
  File "../../../../../see/utils/create_video.py", line 56, in make_video                                                                                           
    temp_file = create_video(i, temp_file, video_dir)                                                                                                               
  File "../../../../../see/utils/create_video.py", line 92, in create_video                                                                                         
    subprocess.run(' '.join(process_args), shell=True, check=True)                                                                                                  
  File "/usr/lib/python3.5/subprocess.py", line 708, in run                                                                                                         
    output=stdout, stderr=stderr)                                                                                                                                   
subprocess.CalledProcessError: Command 'convert -quality 100 @/tmp/tmpp5m2rc0n /tmp/tmp65e4ijjz/1000.mpeg' returned non-zero exit status 137

2) How to interpret the images stored in the boxes folder in the logs. For the Born Digital dataset, the following are a few of the examples over training.

1.png 1

10.png 10

100.png 100

500.png 500

1250.png 1250

Are these images the visualization of what region on the input image, the current layer is being focused on with the first being the focus of the output layer?

3) Do you have any suggestions for some other ground truth format? We want to look into the google 1000 dataset but converting them to the format that you have used in the code seems to be a bit of time consuming task.

Bartzi commented 6 years ago

alright:

  1. I think its not working because the images could be too large (meaning width and/or height), to be fit into a video container. You could set the keyword argument render_extracted_rois to False in the part of the code that creates the BBOXPlotter object (in the train_.. file you are using). This will create smaller images. See the next bullet point for an explanation of what I mean with that.
  2. The images have to be interpreted in the following way:
    • the top-left image shows the input image with the predicted bboxes on it
    • all the other images in the top row show each individual region crop that has been extracted from the original input image, at the location of the predicted bbox (once you set render_extracted_rois to False these images will not be rendered anymore.
    • the bottom row shows the output of visual backprop for this specific image on top.
  3. You can choose the groundtruth format any way you like! You will just need to create a new dataset object for that and use it instead of the ones I created. In this object you can parse your groundtruth and supply it to the network as a numpy-array.

The images you posted seem to show that your network is hardly learning anything right now. I'd advise you to take a curriculum approach and start with easy samples (sampes with few words) first and the increase difficulty, otherwise it might not converge.