emedvedev / attention-ocr

A Tensorflow model for text recognition (CNN + seq2seq with visual attention) available as a Python package and compatible with Google Cloud ML Engine.
MIT License
1.08k stars 256 forks source link

ValueError: '\xe2' is not in list error when i try to train with color parameter #110

Closed kulkarnivishal closed 5 years ago

kulkarnivishal commented 6 years ago

Hi @emedvedev

I get this error after a first few steps when I train with --color parameter (set channels to 3). Could you please help? Please find the error log below: image

emedvedev commented 6 years ago

This error means that your image labels have a character that's not in the model charmap. Check your labels, because by default the supported charset is numbers + uppercase only: https://github.com/emedvedev/attention-ocr/blob/7bb17af211de60e0fff5d56925f73d9018def744/aocr/util/data_gen.py#L23.

kulkarnivishal commented 6 years ago

Thank you for the prompt response. I used --full-ascii and --no-force-uppercase flags as well, I assumed --full-ascii covers all characters. Am I missing something?

emedvedev commented 6 years ago

--full-ascii covers the ASCII range (uppercase, lowercase, symbols, etc.). There's no flag for covering the entire Unicode range, and you'd have to manually modify the CHARMAP (at the line I've mentioned before). Not sure about the performance in that case though—you might have to consider modifying the dataset labels unless Unicode plays significant part in it.

kulkarnivishal commented 6 years ago

thank you for the reply. I am able to train the model. However, I now see a very weird issue. I added synthetic images (generated using GANs) to the training data, and few cropped COCO images. So about 1M synthetic images and Synth90k and about 60k coco images. The training loss is improving but when I test the model for prediction, it performs very poorly. It just prints "cccc" or "aaaaa" etc. Am I doing something wrong?

Here's the training log: image

emedvedev commented 6 years ago

Everything looks correct, so I wouldn't know. Really depends on your dataset, separation of training/testing data, and a whole bunch of other factors.

This might, of course, be a fault in the code or the model itself. In that case, once you pin the issue, please submit a detailed report or a PR—that'd be much appreciated if the aocr code is indeed the problem.

aosetrov commented 6 years ago

Hi @kulkarnivishal , Did you manage to solve this problem? If yes, please share a hint)

kulkarnivishal commented 6 years ago

Not really. Although, I continued the training for 3 more weeks and results look better, not that great though. The main issue I am facing is predicting symbols. No matter how I train the inference seems to be getting it always wrong.

aosetrov commented 6 years ago

You can try to manually change the dictionary instead of using keys. At the same time I have no idea how you managed to add a non-ASCII labeled targets to the train-set .tfrecords file. When I tried to do this it used to throw an error of encoding all the time(

kulkarnivishal commented 6 years ago

you mean manually adding symbols instead of using full-ascii flag? And I used python's string.printable to filter non-ascii characters