Bartzi / see

Code for the AAAI 2018 publication "SEE: Towards Semi-Supervised End-to-End Scene Text Recognition"
GNU General Public License v3.0
574 stars 147 forks source link

Is there any pre-trained model that can recoginize both alphabets and numbers if they appear in localized image? #2

Open arsalan993 opened 6 years ago

Bartzi commented 6 years ago

the FSNS model should technically be able to recognize digits and characters appearing on an image.

But what exactly do you mean with 'localized' image. Do you mean a line of text that has already been cropped from the image?

arsalan993 commented 6 years ago

yes exactly "a line of text that has already been cropped from the image"..

Bartzi commented 6 years ago

You won't find any of this in this repository. But in this repository you can find code for such a case and also a link to a model. You are welcome to check the code and the model.

With the code in this repository it should also be easy to train such a model. You should run the LSTM in the localization part for a given amount of timesteps (lets say 23). The localization part will then produce bounding boxes for each character/digit individually and crop them from the input image. Each of these character-images can then be recognized by the recognition part of the network. So the only thing you need to take care of is the groundtruth file, other than that you should be able to use the network as is.

It might be a good idea to use CTC as loss, but that would be up to you ;)

arsalan993 commented 6 years ago

so if i input a cropped image that contains text for example this sample image ignore the wide space .. so will it be able to read the text from image

Bartzi commented 6 years ago

Yes, you can use this network architecture and code to train a model that can do so.

arsalan993 commented 6 years ago

ok where is your demo code In which i can use your pre-trained model and then perform end-to-end text recognition on given image..

Bartzi commented 6 years ago

There is no such code in this repository, but the other repository contains such code.

You could also use the code in this repository to train such a model. But this will definitely take more time.

arsalan993 commented 6 years ago

i am not interested in training Sir.. I want to use your pre-trained model and use it to read text from an image.. so i am asking is there any code i which i can use the fsns pre-trained model which you have provided here model and use it to read text an a image.. I am not looking for training or evaluation files .. i am looking for a demo or test file..

Bartzi commented 6 years ago

I don't have anything like that. But you can have a look at the evaluation files and create a python script that does s.th. like this for you. Should not be too difficult :wink:.

arsalan993 commented 6 years ago

yes i will and will share it with you.

arsalan993 commented 6 years ago

What is the accuracy and precision of your pre-trained model

Bartzi commented 6 years ago

Sequence accuracy is around 78% on the FSNS dataset, but this is hardly comparable to what you are trying to achieve.

arsalan993 commented 6 years ago

yes i am trying to read non-standard license plate.. i have developed a model to localize license plate in image.. and able to perform character localization and segmentation on it.. whats left is character recognition.. let c ..

arsalan993 commented 6 years ago

Hi there.. I am facing issue while implementing the demo file since there are no comments for patch of code on "evaluation/evaluation.py" file and its becoming difficult for me to understand that which part of function takes which type of output. I request you if you can somehow implement and upload a demo file.

Bartzi commented 6 years ago

Hmm, I think I could do this, but you'd have to wait until next week... I know that the evaluation code loooks difficult, but it is actually quite easy. First, we are reestablishing the network definition. This is why we are saving the python files, containing the network definition. Second, we are loading the model form the given npz file. Then stuff happens that is not necessary for just a demo.

But the function evaluate contains the most interesting part.

  1. we open an image and bring the labels into the correct format
  2. we do a forward pass
  3. we calculate the accuracy (also not of interest for you, only the decoding of network outputs to words might be interesting)

Maybe that makes it easier

arsalan993 commented 6 years ago

while i try it as well .. i really want you to upload the demo code.. that reads the input image file and print the text output and its bounding boxes.. Thanks bro

Bartzi commented 6 years ago

Alright, I've created a small demo script for you. You can find it here.

I've also added usage information to the README here

Hope it helps.

arsalan993 commented 6 years ago

First of all i would like to thank you for taking out time for me.. i have an issue while running this demo file I have setup this repository inside Virtualenv and i am using python3.5 i placed the model inside "chainer/models/" folder and when i run this command python3 fsns_demo.py models/model/ model_35000.npz test/test_aa.jpeg ../datasets/fsns/fsns_char_map.json

i am getting this error `Traceback (most recent call last): File "fsns_demo.py", line 153, in predictions, crops, grids = network(image[xp.newaxis, ...]) File "/home/test/Desktop/see/chainer/models/model/fsns.py", line 516, in call images = F.reshape(images, (batch_size, num_channels, height, 4, -1)) File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 98, in reshape y, = Reshape(shape).apply((x,)) File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/function_node.py", line 230, in apply self._check_data_type_forward(in_data) File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/function_node.py", line 298, in _check_data_type_forward self.check_type_forward(in_type) File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/functions/array/reshape.py", line 40, in check_type_forward type_check.prod(x_type.shape) % size_var == 0) File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/utils/type_check.py", line 524, in expect expr.expect() File "/home/test/Desktop/see/SEE/lib/python3.5/site-packages/chainer/utils/type_check.py", line 482, in expect '{0} {1} {2}'.format(left, self.inv, right)) chainer.utils.type_check.InvalidType: Invalid operation is performed in: Reshape (Forward)

Expect: prod(in_types[0].shape) % known_size(=2004) == 0 Actual: 1503 != 0`

numbers in this line of output "Actual: 1503 != 0" changes when i change the image file

Tell me where i am getting this wrong and i have one more request can you kindly upload your own text files as well

arsalan993 commented 6 years ago

aaaa logo2 test_10 test_aa these are some the test image i am using

Bartzi commented 6 years ago

Alright I see your problem. You get this error, because the input size of your images is not correct, it should be 600x150. But I suggest, that you do not use this script anymore, because it won't do any good for your case (which is what I tried to say earlier, too).

Luckily I also did some experiments with this code on images like the ones you show here, and I decided to also add this code to the repository, although not mentioned in the paper. So I've done the following things:

  1. I added train and evaluation code for models tailored to text recognition.
  2. I added a pre-trained model for text recognition to the model page (here).
  3. I also added a demo script text_recognition_demo.py that works exactly the same way, as the fsns_demo.py script works. You should use this script for experimenting with images like that.

Everything that changed is basically the way the input is handled and I trained a different model. This should be what you are looking for.

Please Note: This model is by far not perfect and not very close to state-of-the-art, but it is a good starting point.

lmolhw5252 commented 6 years ago

@Bartzi Hi,I got a problem about your paper,that the number N used in STN does not say how to set. I had read paper recurrent STN, this paper use 3 characters in each image, so that set N as 3. I hope you can fix my problem.Thank you very much.

Bartzi commented 6 years ago

N specifies the maximum amount of text regions you want to localise in your input image. So if you take the FSNS dataset for instance you will have a maximum number of 6 text regions that can be in an image. So it would be a reasonable choice to set this N to 6.

If you want to do text recognition, like it is discussed in this thread, you will need a higher N (in all our experiments we used 23), because here we want the model to detect each individual character and then recognize each on individually.

But in the end it highly depends on your data.

abhinavg4 commented 6 years ago

Hey @Bartzi , I don't understand this point of yours :- "Please Note: This model is by far not perfect and not very close to state-of-the-art, but it is a good starting point."

As per the STN-OCR paper, your text detecting models were able to produce the SOTA results. Is it the lack of training which is preventing your uploaded model from being SOTA or is it the network architecture itself.

Bartzi commented 6 years ago

Yes, you are right! The model you can find for the STN-OCR code (MXNet) is able to produce SOTA results, but the model we trained with Chainer is not, yet. I think it needs more training and then it should be able to produce similar results.

abhinavg4 commented 6 years ago

But the Architecture of both MXNet and one you trained using Chainer is the same right ? So in case I have a cropped text images (Single line) dataset of my own, Is it better to use MXNet or Chainer given that I'll be doing training from scratch myself.

Bartzi commented 6 years ago

Good question :sweat_smile:. The network architecture is the same. You should be able to get the same results with either codebase. So take what you like more.

ghost commented 6 years ago

Hello, i have an issue while running this demo file text_recognition_demo.py python3 text_recognition_demo.py /home/touma/Téléchargements/text_recognition_model/model model_190000.npz /home/touma/Téléchargements/1014571803205.jpg ../datasets/textrec/ctc_char_map.json i placed the model inside "/home/touma/Téléchargements/text_recognition_model/model " folder and when i run this command i get this : OrderedDict([('pen', [OrderedDict([('bottom_right', (80.65857696533203, 59.958003997802734)), ('top_left', (0.0, 14.809307098388672))]), OrderedDict([('bottom_right', (123.02896881103516, 57.95528793334961)), ('top_left', (33.24436569213867, 14.69124984741211))]), OrderedDict([('bottom_right', (167.39324951171875, 56.29482650756836)), ('top_left', (77.43285369873047, 13.069282531738281))]), OrderedDict([('bottom_right', (200.0, 56.786651611328125)), ('top_left', (114.5182113647461, 13.71943473815918))]), OrderedDict([('bottom_right', (200.0, 58.5263786315918)), ('top_left', (142.25265502929688, 15.275615692138672))]), OrderedDict([('bottom_right', (200.0, 58.94649124145508)), ('top_left', (164.47166442871094, 18.17572021484375))]), OrderedDict([('bottom_right', (200.0, 58.18182373046875)), ('top_left', (183.49403381347656, 21.094255447387695))]), OrderedDict([('bottom_right', (200.0, 60.36030578613281)), ('top_left', (197.76402282714844, 26.316858291625977))]), OrderedDict([('bottom_right', (200.0, 61.74567794799805)), ('top_left', (200.0, 29.776023864746094))]), OrderedDict([('bottom_right', (200.0, 62.451873779296875)), ('top_left', (200.0, 31.797035217285156))]), OrderedDict([('bottom_right', (200.0, 62.80547332763672)), ('top_left', (200.0, 32.96306228637695))]), OrderedDict([('bottom_right', (200.0, 63.00432205200195)), ('top_left', (200.0, 33.67278289794922))]), OrderedDict([('bottom_right', (200.0, 63.13374328613281)), ('top_left', (200.0, 34.12884521484375))]), OrderedDict([('bottom_right', (200.0, 63.23081970214844)), ('top_left', (200.0, 34.43822479248047))]), OrderedDict([('bottom_right', (200.0, 63.31119918823242)), ('top_left', (200.0, 34.65858459472656))]), OrderedDict([('bottom_right', (200.0, 63.38094711303711)), ('top_left', (200.0, 34.82159423828125))]), OrderedDict([('bottom_right', (200.0, 63.44255065917969)), ('top_left', (200.0, 34.945762634277344))]), OrderedDict([('bottom_right', (200.0, 63.49717712402344)), ('top_left', (200.0, 35.04248809814453))]), OrderedDict([('bottom_right', (200.0, 63.54545974731445)), ('top_left', (200.0, 35.119083404541016))]), OrderedDict([('bottom_right', (200.0, 63.58790588378906)), ('top_left', (200.0, 35.18044662475586))]), OrderedDict([('bottom_right', (200.0, 63.625)), ('top_left', (200.0, 35.23004913330078))]), OrderedDict([('bottom_right', (200.0, 63.657230377197266)), ('top_left', (200.0, 35.270381927490234))]), OrderedDict([('bottom_right', (200.0, 63.68511199951172)), ('top_left', (200.0, 35.30332946777344))])])])

Pleaze tell me what's wrong with it,thank you ^^

Bartzi commented 6 years ago

there is nothing wrong :smile:. You get the expected output from this script. Have a closer look at it and you will see that the output is a dict of dicts. The keys of the first dict are the predicted words and the values that belong to this key are the predicted bboxes.

ghost commented 6 years ago

hi, well thanks, i anderstand you i will see that, but i need the output of the text in image how that? wich script i need?

2018-04-06 12:07 GMT+01:00 Christian Bartz notifications@github.com:

there is nothing wrong 😄. You get the expected output from this script. Have a closer look at it and you will see that the output is a dict of dicts. The keys of the first dict are the predicted words and the values that belong to this key are the predicted bboxes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Bartzi/see/issues/2#issuecomment-379221805, or mute the thread https://github.com/notifications/unsubscribe-auth/AjIw_XEKs12TtTTonEFh-3AfO1-JZiZpks5tl0x2gaJpZM4Rqw6P .

Bartzi commented 6 years ago

Hmm, it will be difficult for this kind of examples. First, you did not use the right model for this. The text_recognition model only works on already cropped text lines. Second, we never trained a network on samples like that. The task of detecting and recognizing text from such samples is very difficult for the network and we were yet not able to solve this problem. Performing well on samples like this defintely needs more research.

ghost commented 6 years ago

aah ok, thank you 🙂

2018-04-06 12:39 GMT+01:00 Christian Bartz notifications@github.com:

Hmm, it will be difficult for this kind of examples. First, you did not use the right model for this. The text_recognition model only works on already cropped text lines. Second, we never trained a network on samples like that. The task of detecting and recognizing text from such samples is very difficult for the network and we were yet not able to solve this problem. Performing well on samples like this defintely needs more research.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Bartzi/see/issues/2#issuecomment-379228283, or mute the thread https://github.com/notifications/unsubscribe-auth/AjIw_f6RJ6wkOvW0QmbVmTGHz0sQhu-dks5tl1QAgaJpZM4Rqw6P .

harshalcse commented 5 years ago

Hi @Bartzi , Can we train model on Engraved or embossed metal plates containing VIN number or chasssis number ?

Bartzi commented 5 years ago

Good question :sweat_smile: could work, if you have the corresponding data :wink: