Any work around to retain original form of words ?

PSanni commented 2 years ago

The model currently cannot retain the original form of words. For example, in image if words are "sunflower oil", it returns "sunfloweroil" without space. Is there any work around to address it?

Also, is it possible to fine-tune this model on other dataset such as XFUND (https://github.com/doc-analysis/XFUND) ?

bmusq commented 2 years ago

Hello @PSanni,

For your first problem, namely, retaining original form of words, I do not know how to adress it.

Though, for your second question, I was able to use another dataset of my own (actually being trained). Hereby the solution I came up with. I hope it can be applied to your usecase.

This project makes use of the datasets from this other project https://github.com/ku21fan/STR-Fewer-Labels, as mention in Datasets.md, with few workarounds. If you look into this other project, you will find a section in the Readme.md named "When you need to train on your own dataset or Non-Latin language datasets.". I bet the name is explicit enough. They provide a piece of code in create_lmdb_dataset.py as well as the input format to this file to generate a dataset well formatted to be used by the algorithm, and a fortiori, by parseq as well.

I thouroughly followed the instructions and was able to start a training with parseq on my own dataset.

Edit: the training terminates but the test shows really inconsistent results. Maybe the .mdb file is still problematic. I am exploring this issue

baudm commented 2 years ago

@PSanni for now, you can just directly edit and comment out https://github.com/baudm/parseq/blob/98959c9a43dff9f44898f10a4e8541beb2961150/strhub/data/dataset.py#L85

Note that some preprocessed datasets have had the spaces within labels removed. For the datasets which I preprocessed (COCO, OpenVINO, TextOCR), the spaces within the labels should be intact.

For fine-tuning on other datasets, you have two options:

Write your own Dataset subclass which follows the same public interface as LmdbDataset.
Preprocess your dataset into an LMDB database (see one of the converter scripts in tools to write your own preprocessing script. Then use create_lmdb_dataset.py to create the actual LMDB files).

baudm commented 2 years ago

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so: ./train.py data.remove_whitespace=false data.normalize_unicode=false

PSanni commented 2 years ago

I think its a good idea to include an annotation samples and required input format to the model.

baudm commented 2 years ago

The LMDB format used is unchanged from prior work. create_lmdb_dataset.py expects a text file with one image path and label per line. The actual format is described in the README for the TextOCR and OpenVINO archives.

The conversion from text labels to token IDs is handled by Tokenizer.encode() (in strhub/data/utils.py).

baudm commented 2 years ago

@PSanni since commit e8ea463, you can now disable whitespace removal and/or Unicode normalization like so: ./train.py data.remove_whitespace=false data.normalize_unicode=false

In addition to disabling whitespace (space, tabs, new line, etc.) removal, make sure you add the space character ' ' to charset_train and charset_test so it won't get removed by CharsetAdapter.

Closing this now since all issues have been addressed already. Feel free to reopen if I missed anything.

baudm / parseq

Any work around to retain original form of words ? #5