how do I get 'lexicon.txt' ? I am trying to train the model from scratch

anidiatm41 commented 2 years ago

I am trying to train the model from scratch with custom images, need help with below:

how to prepare the input data step by step
how do I get 'lexicon.txt' ?

arxyzan commented 2 years ago

Hello anidiatm41 and thanks for reaching out. The lexicon.txt file comes with the Synth90k dataset. It's a txt file that maps images paths to their corresponding text. If you're planning to train on Synth90k just download the dataset and you're good to go but if you want to use your own dataset you can generate a lexicon.txt file with the mentioned format or change the dataset/dataset.py file so that you can read from those files if you're not able to change your dataset structure. I do have another private repo in which the image files are named after their corresponding text and you can use this method too.

Just to give you more insight here's a brief explanation on what happens in dataset/dataset.py: There are two properties: paths and texts getting initialized in __init__().

paths: a list of paths to all images in your dataset.
texts: a list of all texts corresponding to the paths in self.paths. The above properties are calculated in _load_from_raw_files() method. You can do any processing and manipulations on these values, i.e ignoring texts that contain invalid characters, etc.

I hope this helps. Best regards.

anidiatm41 commented 2 years ago

Thanks Aryan, it was helpful.

Just to clarify my doubt, I have an excel of 10k crops containing two columns:

Column 1: Image_path (<path/abc.png>)

Column 2: Ground_Truth (<Exact text inside the image with space and special characters>)

path | gt

C:/Users/1234/crop/ABC 07 07 2020_page1.png | 8 05 75 824.46Cr C:/Users/1234/crop/PQW 07 10 2020_page1.png | Time 11 42 23 C:/Users/1234/crop/XRE 08 10 2020_page1.png | Account No. 200000592 C:/Users/1234/crop/JKL 07 10 2020_page1.png | 1 00 00 00 000.00

Now, I need to use this input for the training. Shall I feed this in dataset.py first ? Then need to start training?.

Or if you can help me with steps.

Sorry for these silly questions as I am just learning this stuff.

arxyzan commented 2 years ago

Great. Your dataset structure is exactly the same as Synth90k. One important note to mention though; It seams that your texts contain spaces and unfortunately as far as I've tested before, this implementation cannot handle spaces very well (link). The solution is to remove the spaces when each text is read (in _load_from_raw_files()) and make sure that your detection model (i.e. CRAFT, EAST) detects text boxes in a way that each box contains a single word. Although there is another way for inference: Suppose that we are predicting an image containing the word "hello". The original model outputs something like this: hh--eee--ll--lll--ooo-, then a ctc decoder algorithm tries to drop out the blank/repeating characters. My guess is that if your output contains spaces there would be more blank characters in the model output like so: hh--eee--ll--lll--ooo------ww--ooo-rr--l-dd-. You might be able to work it out based on this hypothesis. Hope this helps. Let me know if you have any other issues. Best regards.

anidiatm41 commented 2 years ago

Great ! CTC part is understood .

One more help, Can you pls provide a sample of annotation_trainin.txt and lexicon.txt file ? -anidiatm41

On Wed, 29 Dec, 2021, 5:26 PM Aryan Shekarlaban, @.***> wrote:

Great. Your dataset structure is exactly the same as Synth90k. One important note to mention though; It seams that your texts contain spaces and unfortunately as far as I've tested before, this implementation cannot handle spaces very well (link https://github.com/meijieru/crnn.pytorch/issues/99#issuecomment-373639504). The solution is to remove the spaces when each text is read (in _load_from_raw_files()) and make sure that your detection model (i.e. CRAFT, EAST) detects text boxes in a way that each box contains a single word. Although there is another way for inference: Suppose that we are predicting an image containing the word "hello". The original model outputs something like this: hh--eee--ll--lll--ooo-, then a ctc decoder algorithm tries to drop out the blank/repeating characters. My guess is that if your output contains spaces there would be more blank characters in the model output like so: hh--eee--ll--lll--ooo------ww--ooo-rr--l-dd-. You might be able to work it out based on this hypothesis. Hope this helps. Let me know if you have any other issues. Best regards.

— Reply to this email directly, view it on GitHub https://github.com/AryanShekarlaban/crnn-pytorch/issues/2#issuecomment-1002559352, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMTD2W7JWCUUQPXAOIT7KN3UTLZQNANCNFSM5K4QGQYQ . You are receiving this because you authored the thread.Message ID: @.***>

arxyzan commented 2 years ago

Actually don't limit yourself by this implementation because it involves preparing different lexicons for train/test/validation. All you need is an annotation reference with paths to texts mappings. In this implementation there is a lexicon.txt with all mappings, then there are 3 more annotation files for train, validation and test. What you should do is filling self.texts and self.paths then splitting them into train test portions. then use PyTorch's function torch.utils.data.random_split() like below:

dataset = CustomDataset()
train_dataset, valid_dataset = random_split(dataset, lengths=[int(len(dataset) * .8), int(len(dataset) * .2)])

The above code creates a dataset. then by passing that to random_split, and specifying lengths (0-80% for train and the rest for validation) you get two instances of train and validation. Although if train and val have different structures and attributes i.e. different transforms you have to change it a little bit. refer to this answer in PyTorch's forum. Overall I really would recommend to write your own Dataset object from scratch as I have done it too in my Persian text recognition project (private repo). I can temporarily add you as a collaborator in that repo if you want. It's a much better/cleaner implementation than this one. Let me know

anidiatm41 commented 2 years ago

Thanks Aryan. I am interested to look your Persian Text Recognition once, if you can allow me temporarily.

arxyzan commented 2 years ago

Cool. Check your inbox.

anidiatm41 commented 2 years ago

got it.thanks a lot!

arxyzan commented 2 years ago

Your welcome. Let me just point out something about that implementation:

The dataset structure is based on the files names : INDEX_TEXT.jpg so the files are read from a directory and their texts are extracted from filenames. So what I recommend is that you create your dataset with this format so that everything is sticked together and the codes just work with ease.
Persian words are written in right to left format so there is a Mirror transform for images. (remove this for English)
Persian numbers are written in left to right format so there is a check if the word is all numbers so that the text is reversed.(remove this for English)
The CRNN model cannot output more than 24 characters so the words of that length are excluded from the dataset.
There is a Normalize transform in transforms which is calculated in dataset.py. just run python dataset.py I hope that implementation is not so complicated and you have fun using it!

anidiatm41 commented 2 years ago

Surely! Cheers!!

On Wed, 29 Dec, 2021, 8:53 PM Aryan Shekarlaban, @.***> wrote:

Your welcome. Let me just point out something about that implementation:

The dataset structure is based on the files names : INDEX_TEXT.jpg so the files are read from a directory and their texts are extracted from filenames. So what I recommend is that you create your dataset with this format so that everything is sticked together and the codes just work with ease.

Persian words are written in right to left format so there is a Mirror transform for images. (remove this for English)

Persian numbers are written in left to right format so there is a check if the word is all numbers so that the text is reversed.(remove this for English)

The CRNN model cannot output more than 24 characters so the words of that length are excluded from the dataset.

There is a Normalize transform in transforms which is calculated in dataset.py. just run python dataset.py I hope that implementation is not so complicated and you have fun using it!

— Reply to this email directly, view it on GitHub https://github.com/AryanShekarlaban/crnn-pytorch/issues/2#issuecomment-1002647058, or unsubscribe https://github.com/notifications/unsubscribe-auth/AMTD2W34CJ7K2XOKCCTV64DUTMRY7ANCNFSM5K4QGQYQ . You are receiving this because you authored the thread.Message ID: @.***>

arxyzan / crnn-pytorch

how do I get 'lexicon.txt' ? I am trying to train the model from scratch #2