baudm / parseq

Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
https://huggingface.co/spaces/baudm/PARSeq-OCR
Apache License 2.0
592 stars 130 forks source link

Recommendation for training on new language ? #9

Open PSanni opened 2 years ago

PSanni commented 2 years ago

Any recommendations to train or fine-tune model on new language.

  1. Does training for new language (E.x. arabic) will work on existing pre-trained models ? Or it has to be from scratch.
  2. What is recommended amount of data for new language ?
baudm commented 2 years ago
  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic, where configs/charset/arabic.yaml contains:

    # @package _global_
    model:
    charset_train: "..."
    charset_test: "..."
  2. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

baudm commented 2 years ago
image

This is an old result for PARSeq. I was comparing the validation word accuracy for models trained exclusively on TextOCR (arbitrary) and its pose-corrected version (horizontal). DDP was used with 2 GPUs, so the effective iteration is the number shown in the x-axis multiplied by 2.

phamkhactu commented 2 years ago
  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic model.charset_test=<Arabic characters>
  2. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.

baudm commented 2 years ago

@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.

@phamkhactu Re: finetuning, please refer to my first comment.

For dataset preparation, please refer to clovaai/deep-text-recognition-benchmark on how to convert your image-text pairs into LMDB databases.

PSanni commented 2 years ago
  1. Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml and use it during training. You should also update charset_test with the same set of characters used for training. The training command should look something like ./train.py charset=arabic model.charset_test=<Arabic characters>
  2. I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).

When you say quality, does that mean quality of images ? or the term coverage ?

siddagra commented 2 years ago

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.

When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters.

Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

siddagra commented 2 years ago

For now I have used a dirty hack and used 94_full charset's symbols to represent Chinese characters. Mapping chinese characters to symbols in lmdb dataset and then back from symbols to Chinese characters during/after inference.

baudm commented 2 years ago

@PSanni

When you say quality, does that mean quality of images ? or the term coverage ?

By dataset quality, I mean dataset size, diversity of samples, accuracy of labels, etc., not quality of images per se.

baudm commented 2 years ago

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.

When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters.

Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.

  1. First, you have to make sure that your training dataset is correctly prepared. Open the lmdb archives, query an image and its corresponding label, and check if the image and label are correct and intact.
  2. Disable Unicode normalization: data.normalize_unicode=false
  3. Probe the SceneTextDataModule instance. You can do it in any script (train.py, read.py, and test.py). You can check the labels returned by LmdbDataset using the train_dataset or val_dataset property of the data module instance. Make sure it returns the expected labels.
  4. Check if CharsetAdapter works. Create an instance using your charset, e.g. adapter = CharsetAdapter(charset), then test if it returns the correct output given Chinese text: adapter(some_text).
siddagra commented 2 years ago

Thanks a lot for your help!

I printed out several labels at several places, base.py, dataset.py, while lmdb encoding and decoding, etc.

base.py

 141           pred = self.charset_adapter(pred)
 142           print(pred)

dataset.py

                label = charset_adapter(label)
                print(label)

It seems to be working fine everywhere. The only issue seems to be the read.py itself perhaps. The argmax output of the variable p (logits from output) are not outputting any index higher than 23, even though the sequence should include characters indices up to 96.

tensor([[ 6,  9, 20,  9,  9,  1,  9,  3,  4,  2,  5, 12, 23,  8,  8, 12, 23, 12,
         23,  8, 21, 23, 13, 19, 13, 23,  8, 12, 12,  8, 12, 23, 23, 23,  8, 21,
          8,  0,  8,  8,  8, 20, 13,  9, 22, 21, 21, 20,  8, 18, 25, 21,  8, 23,
         22, 14, 16, 22, 19, 13, 14, 21, 17, 22, 23, 24, 13, 22, 25, 13, 23, 21,
         12,  9, 23, 23, 13, 23, 22, 23, 23, 22, 13,  9,  8, 23, 23, 23, 24, 23,
         13, 23, 23, 23, 21]], device='cuda:0')

I wanted to get results to report, but read.py is ignoring chinese characters and It somehow started working now. I think because I disabled unicode normalisation. Thanks!

test.py is giving this error:

dataset.py", line 72, in __del__
    self.env.close()
AttributeError: 'LmdbDataset' object has no attribute 'env'

Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.

baudm commented 2 years ago

@siddagra

It somehow started working now. I think because I disabled unicode normalisation. Thanks!

test.py is giving this error:

dataset.py", line 72, in __del__
    self.env.close()
AttributeError: 'LmdbDataset' object has no attribute 'env'

You're using old code. Pull the latest and update your dependencies.

Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.

Sorry, this is not possible with PARSeq since its LM is internal. You can do this with ABINet, but honestly in my opinion, training on raw text has limited utility for STR since it is still primarily a visual recognition problem.

To alleviate the issue with your limited training data, I would suggest using a more extreme augmentation on the images: rotate them by 90, 180, or 270 degrees. You can do this by modifying augment.py or by directly adding the rotations inside the image transforms in module.py.

You may also lower the batch size in order to increase the variance and lessen the bias for each mini-batch. You could also play around with the value of K.

One STR-specific augmentation would be to form new training data by concatenating existing data. I have implemented a simple version of this and it works (but more experimental validation is needed). The algorithm is something like this:

  1. Choose a pair of samples.
  2. Allocate the image width proportional to the label length. That is, W = 128 * len(A) / (len(A) + len(B)) would be the allocation in pixels for image A.
  3. Resize the images to W x 32 and (128 - W) x 32 pixels, then concatenate them side by side.
  4. Concatenate the labels.

Lastly, you may try adding the augmentations implemented in straug.

PSanni commented 2 years ago

is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.

baudm commented 2 years ago

is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.

https://github.com/baudm/parseq/issues/7#issuecomment-1198845845

baudm commented 2 years ago

@PSanni @siddagra @bmusq As of commit b290950dad5a3dceb574cbc2d902765e1496ace2, finetuning is now officially supported. checkpoint parameter of test.py and read.py has been changed accordingly.

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights
# Resume from PL checkpoint
./train.py ckpt_path=outputs/parseq/.../last.ckpt

# Use pretrained weights for testing
./test.py pretrained=parseq  # same with read.py
# Or your own trained weights
./test.py outputs/parseq/.../last.ckpt  # same with read.py
siddagra commented 2 years ago

I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml. When I print out model.config while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py it does not output any chinese characters. Not sure if this is an issue in read.py, train.py, test.py or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.

@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.

  1. First, you have to make sure that your training dataset is correctly prepared. Open the lmdb archives, query an image and its corresponding label, and check if the image and label are correct and intact.
  2. Disable Unicode normalization: data.normalize_unicode=false
  3. Probe the SceneTextDataModule instance. You can do it in any script (train.py, read.py, and test.py). You can check the labels returned by LmdbDataset using the train_dataset or val_dataset property of the data module instance. Make sure it returns the expected labels.
  4. Check if CharsetAdapter works. Create an instance using your charset, e.g. adapter = CharsetAdapter(charset), then test if it returns the correct output given Chinese text: adapter(some_text).

I am running a synthetic data generator + imgaug to generate augmentations/distortions so that I can incorporate my own formats/language requirements. Any way to have it dynamically load images during data loading? instead of having to specify an LMDB dataset? or do you think that will make training too slow?

airogachev commented 2 years ago

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

bmusq commented 2 years ago

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

As far as I know, you should have your data formatted in the LMDB format. Make use of the create_lmdb_dataset.py script provided in tools folder. Make sure first that your data can actually be feed into this script. This probably requires you to create your own converter. In the same folder, check others python files which are converter themselves.

Once that's done, you have to put data.mdb and lock.mdb in the data\train\real folder. Thouroughly follow the folder structure as described in the Readme of the data section.

Now if, like me, you have downloaded all the datasets used in this paper, your data folder should already be well populated. Something you can do is create a new folder, lets say _customdata, follow the same architecture and put your .mdb files there.

Finally, in configs, open main.yaml and change root_dir to _customdata

I have done some finetuning myself and it works like a charm.

airogachev commented 2 years ago

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

bmusq commented 2 years ago

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

I used the pretrained weights and only my own data for tuning.

As for other parameters:

airogachev commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

bmusq commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I believe it is, yes. I think that is what @siddagra has done. Please see the top of this thread

bmusq commented 2 years ago

@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?

One more thing, if the amount of data you have is small, like it was my case, you might also want to change the val_check_interval in confgis\main.yaml. By default it is set to 1000, which means, you are doing a validation process every 1000 batches. Though, if you have not enough data per epoch, you will never trigger validation because you do not acutally have that much batches, especially with batches of size 384.

Something you can do is set val_check_interval between 0 and 1 and it will trigger validation after the given fraction of the training epoch. If you want to explore more about very specific parameters of the Trainer please follow this link: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html

PSanni commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

bmusq commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

PSanni commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

baudm commented 2 years ago

@PSanni

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

If you don't change img_size and patch_size, you could still use the pretrained weights to initialize the encoder and the decoder. You need to do it manually though; refer to PyTorch docs on how to do partial loading of state dict. The character and position embeddings have to be trained from scratch if you change charset or model.max_label_length.

siddagra commented 2 years ago

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

I set up the data using the process I mentioned in: https://github.com/baudm/parseq/issues/7

siddagra commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.

PSanni commented 2 years ago

@bmusq so, it is possible to finetune the model even changing the charset, right?

I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.

Have you try disabling unicode normalization ?

Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.

Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.

agree, but i am trying to train it for multilingual use, so i have to use all the characters.

phamkhactu commented 2 years ago

@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?

image

airogachev commented 2 years ago

Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?

airogachev commented 2 years ago

Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?

baudm commented 2 years ago

@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?

@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:

  1. Single input: very wide image. Need to adjust img_size and patch_size to accommodate the expected wide images. Possible issue: quadratic increase in compute requirements since MHA is O(n^2).
  2. Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?

@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.

Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?

PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g. o and ó, it's better to use the same literal for both. In fact, this is the default behavior--Unicode characters are normalized (data.normalize_unicode=True) such that accented characters are converted to their base (ASCII) form (accents and formatting are discarded).

siddagra commented 2 years ago

Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.

PSanni commented 2 years ago

Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.

Yes in my experiments, i found that word level text detection followed by parseq performed best for English and other 4 languages. However, it was not good with non-Latin languages when words are > 2.

airogachev commented 2 years ago

Models in this repo were trained on 128x32 px images.

So, you mean vit, don't you? All the images are cropped and all the crops are processed and you aggregate embeddings just the way VIT does it, right? At this point it looks like the initial image shape doesn't matter. I just noticed that images in the dataset that you used for the training have different shapes in different sets, so I wanted to figure out whether some particular pool of shapes exists.

siddagra commented 2 years ago

Loading pretrained model using pretrained argument in main.yaml causes KeyError.

Traceback (most recent call last):
  File "/.../parseq/train.py", line 83, in main
    trainer.fit(model, datamodule=datamodule)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
    val_loop.run()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/.../parseq/strhub/models/base.py", line 153, in validation_step
    return self._eval_step(batch, True)
  File "/.../parseq/strhub/models/base.py", line 107, in _eval_step
    logits, loss, loss_numel = self.forward_logits_loss(images, labels)
  File "/.../parseq/strhub/models/base.py", line 177, in forward_logits_loss
    targets = self.tokenizer.encode(labels, self.device)
  File "/.../parseq/strhub/data/utils.py", line 115, in encode
    batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
  File "/.../parseq/strhub/data/utils.py", line 115, in <listcomp>
    batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
  File "/.../parseq/strhub/data/utils.py", line 56, in _tok2ids
    return [self._stoi[s] for s in tokens]
  File "/.../parseq/strhub/data/utils.py", line 56, in <listcomp>
    return [self._stoi[s] for s in tokens]
KeyError: '苏'

Seems that the BaseTokenizer class is using the default charset and is not overwritten, which then causes a KeyError to occur.

Charset in BaseTokenizer:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Overwriting the charset in the __init__ seems to fix this issue:

class BaseTokenizer(ABC):

    def __init__(self, charset: str, specials_first: tuple = (), specials_last: tuple = ()) -> None:
        charset = "皖P沪津渝冀晋蒙辽吉黑苏浙京闽赣鲁豫鄂川贵云藏陕甘青宁新警学" # pad this with any characters to 94 length to be compatible with the pretrained weights
        self._itos = specials_first + tuple(charset) + specials_last
        self._stoi = {s: i for i, s in enumerate(self._itos)}
baudm commented 2 years ago

@siddagra

Loading pretrained model using pretrained argument in main.yaml causes KeyError.

If you finetune a pretrained model specified by the pretrained argument, you're restricted to the 94-character Latin vocabulary. If you want to use a different character set, you need to train from scratch but you have the option of initializing parts of the model using the pretrained weights (by modifying train.py and manually using load_state_dict()).

phamkhactu commented 2 years ago

@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?

@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:

  1. Single input: very wide image. Need to adjust img_size and patch_size to accommodate the expected wide images. Possible issue: quadratic increase in compute requirements since MHA is O(n^2).
  2. Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.

Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?

@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.

Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?

PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g. o and ó, it's better to use the same literal for both. In fact, this is the default behavior--Unicode characters are normalized (data.normalize_unicode=True) such that accented characters are converted to their base (ASCII) form (accents and formatting are discarded).

@baudm I had configs change input size image to 32x150, max_label_length to 300, normalize_unicode=True. Model training have high acc val 90.65846, but when I test model doesn't work well: image

Result is:

With thie rize

Thanks for your help

tommyjiang commented 2 years ago

Loading pretrained model using pretrained argument in main.yaml causes KeyError.

Traceback (most recent call last):
  File "/.../parseq/train.py", line 83, in main
    trainer.fit(model, datamodule=datamodule)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
    self._call_and_handle_interrupt(
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
    results = self._run_stage()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
    return self._run_train()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
    self._run_sanity_check()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
    val_loop.run()
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
    dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
    self.advance(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
    output = self._evaluation_step(**kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
    output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
    output = fn(*args, **kwargs)
  File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
    return self.model.validation_step(*args, **kwargs)
  File "/.../parseq/strhub/models/base.py", line 153, in validation_step
    return self._eval_step(batch, True)
  File "/.../parseq/strhub/models/base.py", line 107, in _eval_step
    logits, loss, loss_numel = self.forward_logits_loss(images, labels)
  File "/.../parseq/strhub/models/base.py", line 177, in forward_logits_loss
    targets = self.tokenizer.encode(labels, self.device)
  File "/.../parseq/strhub/data/utils.py", line 115, in encode
    batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
  File "/.../parseq/strhub/data/utils.py", line 115, in <listcomp>
    batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
  File "/.../parseq/strhub/data/utils.py", line 56, in _tok2ids
    return [self._stoi[s] for s in tokens]
  File "/.../parseq/strhub/data/utils.py", line 56, in <listcomp>
    return [self._stoi[s] for s in tokens]
KeyError: '苏'

Seems that the BaseTokenizer class is using the default charset and is not overwritten, which then causes a KeyError to occur.

Charset in BaseTokenizer:

0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~

Overwriting the charset in the __init__ seems to fix this issue:

class BaseTokenizer(ABC):

    def __init__(self, charset: str, specials_first: tuple = (), specials_last: tuple = ()) -> None:
        charset = "皖P沪津渝冀晋蒙辽吉黑苏浙京闽赣鲁豫鄂川贵云藏陕甘青宁新警学" # pad this with any characters to 94 length to be compatible with the pretrained weights
        self._itos = specials_first + tuple(charset) + specials_last
        self._stoi = {s: i for i, s in enumerate(self._itos)}

I also train this model for recognizing chinese characters, but I train it from scratch. You can change charset_train and charset_test in the yaml config file.

Sydeboy commented 1 year ago

image when I use the 62 charset to finetune the parseq, it cause the error. only use the 94 charset to finetune the model?

Dordor333 commented 4 months ago

Now you can do:

# Finetuning
./train.py pretrained=parseq  # parseq-tiny, etc. See released weights

How should the finetuning data be properly passed to the train.py at this point?

I set up the data using the process I mentioned in: #7

What is the simplest way to load pretrained weights into tune script in order to fine tune the pretrained weights?

anikde commented 3 months ago

I have trained parseq on synthetic data and now I want to use this checkpoint to further train the model with real data. I have got such checkpoint "epoch=412-step=114259-val_accuracy=96.8273-val_NED=97.6637.ckpt". I believe to finetune now, I will have use ./train.py ckpt_path=outputs/parseq/.../...NED=97.6637.ckpt (refer to above comment )command to finetune. But I am geting the following error.

$ ./train.py ckpt_path=checkpoint/checkpoints/epoch=412-step=114259-val_accuracy=96.8273-val_NED=97.6637.ckpt
mismatched input '=' expecting <EOF>
See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details

Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.