Open PSanni opened 2 years ago
Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g. configs/charset/arabic.yaml
and use it during training. You should also update charset_test
with the same set of characters used for training. The training command should look something like ./train.py charset=arabic
, where configs/charset/arabic.yaml
contains:
# @package _global_
model:
charset_train: "..."
charset_test: "..."
I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).
This is an old result for PARSeq. I was comparing the validation word accuracy for models trained exclusively on TextOCR (arbitrary) and its pose-corrected version (horizontal). DDP was used with 2 GPUs, so the effective iteration is the number shown in the x-axis multiplied by 2.
- Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g.
configs/charset/arabic.yaml
and use it during training. You should also updatecharset_test
with the same set of characters used for training. The training command should look something like./train.py charset=arabic model.charset_test=<Arabic characters>
- I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).
@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.
@baudm Thank for your great repo. I want to fintune for vietnamese language. Does train it?. And It can, how to prepare Dataset to training. Many thank for your reponse.
@phamkhactu Re: finetuning, please refer to my first comment.
For dataset preparation, please refer to clovaai/deep-text-recognition-benchmark on how to convert your image-text pairs into LMDB databases.
- Fine-tuning should work for any language based on the Latin alphabet. If the language uses a different set of characters, you should define a new training charset configuration, e.g.
configs/charset/arabic.yaml
and use it during training. You should also updatecharset_test
with the same set of characters used for training. The training command should look something like./train.py charset=arabic model.charset_test=<Arabic characters>
- I don't have a definite answer for this since it would depend on the quality of your training data, and how similar its distribution is to the test data. In our experiments with real training data, PARSeq starts to perform well after 40k iterations (batch size = 384).
When you say quality, does that mean quality of images ? or the term coverage ?
I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.
When I print out model.config
while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images using read.py
it does not output any chinese characters.
Not sure if this is an issue in read.py
, train.py
, test.py
or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.
For now I have used a dirty hack and used 94_full charset's symbols to represent Chinese characters. Mapping chinese characters to symbols in lmdb dataset and then back from symbols to Chinese characters during/after inference.
@PSanni
When you say quality, does that mean quality of images ? or the term coverage ?
By dataset quality, I mean dataset size, diversity of samples, accuracy of labels, etc., not quality of images per se.
I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml.
When I print out
model.config
while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images usingread.py
it does not output any chinese characters.Not sure if this is an issue in
read.py
,train.py
,test.py
or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.
@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.
data.normalize_unicode=false
SceneTextDataModule
instance. You can do it in any script (train.py
, read.py
, and test.py
). You can check the labels returned by LmdbDataset
using the train_dataset
or val_dataset
property of the data module instance. Make sure it returns the expected labels.CharsetAdapter
works. Create an instance using your charset, e.g. adapter = CharsetAdapter(charset)
, then test if it returns the correct output given Chinese text: adapter(some_text)
.Thanks a lot for your help!
I printed out several labels at several places, base.py, dataset.py, while lmdb encoding and decoding, etc.
base.py
141 pred = self.charset_adapter(pred)
142 print(pred)
dataset.py
label = charset_adapter(label)
print(label)
It seems to be working fine everywhere. The only issue seems to be the read.py
itself perhaps.
The argmax output of the variable p
(logits from output) are not outputting any index higher than 23, even though the sequence should include characters indices up to 96.
tensor([[ 6, 9, 20, 9, 9, 1, 9, 3, 4, 2, 5, 12, 23, 8, 8, 12, 23, 12,
23, 8, 21, 23, 13, 19, 13, 23, 8, 12, 12, 8, 12, 23, 23, 23, 8, 21,
8, 0, 8, 8, 8, 20, 13, 9, 22, 21, 21, 20, 8, 18, 25, 21, 8, 23,
22, 14, 16, 22, 19, 13, 14, 21, 17, 22, 23, 24, 13, 22, 25, 13, 23, 21,
12, 9, 23, 23, 13, 23, 22, 23, 23, 22, 13, 9, 8, 23, 23, 23, 24, 23,
13, 23, 23, 23, 21]], device='cuda:0')
I wanted to get results to report, but read.py
is ignoring chinese characters and
It somehow started working now. I think because I disabled unicode normalisation. Thanks!
test.py
is giving this error:
dataset.py", line 72, in __del__
self.env.close()
AttributeError: 'LmdbDataset' object has no attribute 'env'
Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.
@siddagra
It somehow started working now. I think because I disabled unicode normalisation. Thanks!
test.py
is giving this error:dataset.py", line 72, in __del__ self.env.close() AttributeError: 'LmdbDataset' object has no attribute 'env'
You're using old code. Pull the latest and update your dependencies.
Also, is it possible to train only the LM model? The dataset I am training on contains limited language of a specific format, but I do not want it to overfit to this format and have poor results otherwise, I was wondering if it was possible to only train it on text data/character sequences itself, instead of images+labels. It may be useful to be able to train LM on larger language (non-image) datasets for other languages with limited image data.
Sorry, this is not possible with PARSeq since its LM is internal. You can do this with ABINet, but honestly in my opinion, training on raw text has limited utility for STR since it is still primarily a visual recognition problem.
To alleviate the issue with your limited training data, I would suggest using a more extreme augmentation on the images: rotate them by 90, 180, or 270 degrees. You can do this by modifying augment.py
or by directly adding the rotations inside the image transforms in module.py
.
You may also lower the batch size in order to increase the variance and lessen the bias for each mini-batch. You could also play around with the value of K
.
One STR-specific augmentation would be to form new training data by concatenating existing data. I have implemented a simple version of this and it works (but more experimental validation is needed). The algorithm is something like this:
Lastly, you may try adding the augmentations implemented in straug.
is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.
is it possible to continue training from checkpoints ? if so, is there any pre-trained weights available for fine-tuning ? Would be great if you can write short note on it.
https://github.com/baudm/parseq/issues/7#issuecomment-1198845845
@PSanni @siddagra @bmusq
As of commit b290950dad5a3dceb574cbc2d902765e1496ace2, finetuning is now officially supported. checkpoint
parameter of test.py
and read.py
has been changed accordingly.
Now you can do:
# Finetuning
./train.py pretrained=parseq # parseq-tiny, etc. See released weights
# Resume from PL checkpoint
./train.py ckpt_path=outputs/parseq/.../last.ckpt
# Use pretrained weights for testing
./test.py pretrained=parseq # same with read.py
# Or your own trained weights
./test.py outputs/parseq/.../last.ckpt # same with read.py
I am also having an issue with this. When training and validating I have set all character sets in train and test to chinese characters + latin alphanumerics and even created a separate file for yaml. When I print out
model.config
while training, it seems to show the charset properly, but then after training when I use the checkpoints to recognise images usingread.py
it does not output any chinese characters. Not sure if this is an issue inread.py
,train.py
,test.py
or lmdb dataset itself as the val accuracy is 99.93%. Please guide/help if possible.@siddagra unless you have a very small and easy val set, val accuracy of 99.93% likely indicates a problem with your training setup.
- First, you have to make sure that your training dataset is correctly prepared. Open the lmdb archives, query an image and its corresponding label, and check if the image and label are correct and intact.
- Disable Unicode normalization:
data.normalize_unicode=false
- Probe the
SceneTextDataModule
instance. You can do it in any script (train.py
,read.py
, andtest.py
). You can check the labels returned byLmdbDataset
using thetrain_dataset
orval_dataset
property of the data module instance. Make sure it returns the expected labels.- Check if
CharsetAdapter
works. Create an instance using your charset, e.g.adapter = CharsetAdapter(charset)
, then test if it returns the correct output given Chinese text:adapter(some_text)
.
I am running a synthetic data generator + imgaug to generate augmentations/distortions so that I can incorporate my own formats/language requirements. Any way to have it dynamically load images during data loading? instead of having to specify an LMDB dataset? or do you think that will make training too slow?
Now you can do:
# Finetuning ./train.py pretrained=parseq # parseq-tiny, etc. See released weights
How should the finetuning data be properly passed to the train.py at this point?
Now you can do:
# Finetuning ./train.py pretrained=parseq # parseq-tiny, etc. See released weights
How should the finetuning data be properly passed to the train.py at this point?
As far as I know, you should have your data formatted in the LMDB format. Make use of the create_lmdb_dataset.py
script provided in tools folder. Make sure first that your data can actually be feed into this script. This probably requires you to create your own converter. In the same folder, check others python files which are converter themselves.
Once that's done, you have to put data.mdb
and lock.mdb
in the data\train\real folder. Thouroughly follow the folder structure as described in the Readme of the data section.
Now if, like me, you have downloaded all the datasets used in this paper, your data folder should already be well populated. Something you can do is create a new folder, lets say _customdata, follow the same architecture and put your .mdb files there.
Finally, in configs, open main.yaml
and change root_dir
to _customdata
I have done some finetuning myself and it works like a charm.
@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?
@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?
I used the pretrained weights and only my own data for tuning.
read.py
I was able to visually confirm the improvementAs for other parameters:
@bmusq so, it is possible to finetune the model even changing the charset, right?
@bmusq so, it is possible to finetune the model even changing the charset, right?
I believe it is, yes. I think that is what @siddagra has done. Please see the top of this thread
@bmusq have you changed any parameters like number of epochs or learning rate? Or did you just ran the train.py as is? And one more - did you use only your data for tuning or did you add it to the initial data from the paper?
One more thing, if the amount of data you have is small, like it was my case, you might also want to change the val_check_interval
in confgis\main.yaml
. By default it is set to 1000, which means, you are doing a validation process every 1000 batches. Though, if you have not enough data per epoch, you will never trigger validation because you do not acutally have that much batches, especially with batches of size 384.
Something you can do is set val_check_interval
between 0 and 1 and it will trigger validation after the given fraction of the training epoch. If you want to explore more about very specific parameters of the Trainer please follow this link: https://pytorch-lightning.readthedocs.io/en/stable/common/trainer.html
@bmusq so, it is possible to finetune the model even changing the charset, right?
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.
@bmusq so, it is possible to finetune the model even changing the charset, right?
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.
Have you try disabling unicode normalization ?
@bmusq so, it is possible to finetune the model even changing the charset, right?
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.
Have you try disabling unicode normalization ?
Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.
@PSanni
Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.
If you don't change img_size
and patch_size
, you could still use the pretrained weights to initialize the encoder and the decoder. You need to do it manually though; refer to PyTorch docs on how to do partial loading of state dict. The character and position embeddings have to be trained from scratch if you change charset
or model.max_label_length
.
Now you can do:
# Finetuning ./train.py pretrained=parseq # parseq-tiny, etc. See released weights
How should the finetuning data be properly passed to the train.py at this point?
I set up the data using the process I mentioned in: https://github.com/baudm/parseq/issues/7
@bmusq so, it is possible to finetune the model even changing the charset, right?
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.
Have you try disabling unicode normalization ?
Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.
Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.
@bmusq so, it is possible to finetune the model even changing the charset, right?
I tried to fine-tune pre-trained "parseq" with arabic data. I dont't know if its just with me or anyone else, but amending train charset with arabic characters triggered text dimensional mismatch issue. I am still looking for work around, but would be great if anyone with successful try can advice on it.
Have you try disabling unicode normalization ?
Yes, i did. But the problem is with size of embedding layer, which is 95 (char in 94_full.yaml). Therefore, including any additional characters cause mismatch error. So, i think only option left is retraining model with multiple languages or permuting input layer with additional layer and freezing some weights.
Can you not just add dummy characters to the train_charset and remove them from the test_charset? Unless u need more than 94 chars, this should work. This is essentially what I did.
agree, but i am trying to train it for multilingual use, so i have to use all the characters.
@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?
Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?
Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?
@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?
@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:
img_size
and patch_size
to accommodate the expected wide images. Possible issue: quadratic increase in compute requirements since MHA is O(n^2).Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?
@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.
Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?
PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g. o
and ó
, it's better to use the same literal for both. In fact, this is the default behavior--Unicode characters are normalized (data.normalize_unicode=True
) such that accented characters are converted to their base (ASCII) form (accents and formatting are discarded).
Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.
Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.
Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.
Perhaps one can use a text detector model to first get word by word crops. Then run parseq on them in batch. This is typically how such a case is handeled in STR afaik.
Yes in my experiments, i found that word level text detection followed by parseq performed best for English and other 4 languages. However, it was not good with non-Latin languages when words are > 2.
Models in this repo were trained on 128x32 px images.
So, you mean vit, don't you? All the images are cropped and all the crops are processed and you aggregate embeddings just the way VIT does it, right? At this point it looks like the initial image shape doesn't matter. I just noticed that images in the dataset that you used for the training have different shapes in different sets, so I wanted to figure out whether some particular pool of shapes exists.
Loading pretrained model using pretrained
argument in main.yaml
causes KeyError.
Traceback (most recent call last):
File "/.../parseq/train.py", line 83, in main
trainer.fit(model, datamodule=datamodule)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit
self._call_and_handle_interrupt(
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt
return trainer_fn(*args, **kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl
results = self._run(model, ckpt_path=self.ckpt_path)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run
results = self._run_stage()
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage
return self._run_train()
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train
self._run_sanity_check()
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check
val_loop.run()
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance
dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run
self.advance(*args, **kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance
output = self._evaluation_step(**kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step
output = self.trainer._call_strategy_hook("validation_step", *kwargs.values())
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook
output = fn(*args, **kwargs)
File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step
return self.model.validation_step(*args, **kwargs)
File "/.../parseq/strhub/models/base.py", line 153, in validation_step
return self._eval_step(batch, True)
File "/.../parseq/strhub/models/base.py", line 107, in _eval_step
logits, loss, loss_numel = self.forward_logits_loss(images, labels)
File "/.../parseq/strhub/models/base.py", line 177, in forward_logits_loss
targets = self.tokenizer.encode(labels, self.device)
File "/.../parseq/strhub/data/utils.py", line 115, in encode
batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
File "/.../parseq/strhub/data/utils.py", line 115, in <listcomp>
batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device)
File "/.../parseq/strhub/data/utils.py", line 56, in _tok2ids
return [self._stoi[s] for s in tokens]
File "/.../parseq/strhub/data/utils.py", line 56, in <listcomp>
return [self._stoi[s] for s in tokens]
KeyError: '苏'
Seems that the BaseTokenizer
class is using the default charset and is not overwritten, which then causes a KeyError to occur.
Charset in BaseTokenizer
:
0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Overwriting the charset in the __init__
seems to fix this issue:
class BaseTokenizer(ABC):
def __init__(self, charset: str, specials_first: tuple = (), specials_last: tuple = ()) -> None:
charset = "皖P沪津渝冀晋蒙辽吉黑苏浙京闽赣鲁豫鄂川贵云藏陕甘青宁新警学" # pad this with any characters to 94 length to be compatible with the pretrained weights
self._itos = specials_first + tuple(charset) + specials_last
self._stoi = {s: i for i, s in enumerate(self._itos)}
@siddagra
Loading pretrained model using
pretrained
argument inmain.yaml
causes KeyError.
If you finetune a pretrained model specified by the pretrained
argument, you're restricted to the 94-character Latin vocabulary. If you want to use a different character set, you need to train from scratch but you have the option of initializing parts of the model using the pretrained weights (by modifying train.py
and manually using load_state_dict()
).
@baudm I have try to recognize Image(contain the sentence input). I know that your model now use for word level. My question is: Does model can train input image(sentence)?
@phamkhactu yes, the model can be modified to train on long sentences. Off the top of my head, possible approaches are:
- Single input: very wide image. Need to adjust
img_size
andpatch_size
to accommodate the expected wide images. Possible issue: quadratic increase in compute requirements since MHA is O(n^2).- Multiple inputs: use a sliding window approach. Apply model to non-overlapping crops of the input. Possible issues: characters at crop boundary will be cut off, repeated character detections.
Should the training examples have some particular size or whether I'd better try to vary the resolution? Is it better to add images that only contain cropped text?
@rogachevai STR operates on cropped image inputs. Models in this repo were trained on 128x32 px images.
Is there a better way to process cases of multiple languages with same literals? Let's say that I want to fit a model with English and Greek words in the training data. Should I use the same symbol for "o" in English and Greek words or should I add one more "o"? So does the charset should contain only different characters in term of visual representation? Does it affect the fitting procedure somehow?
PARSeq and the other models here are all character-based methods. If the shapes of the characters are roughly the same, e.g.
o
andó
, it's better to use the same literal for both. In fact, this is the default behavior--Unicode characters are normalized (data.normalize_unicode=True
) such that accented characters are converted to their base (ASCII) form (accents and formatting are discarded).
@baudm I had configs change input size image to 32x150, max_label_length to 300, normalize_unicode=True. Model training have high acc val 90.65846, but when I test model doesn't work well:
Result is:
With thie rize
Thanks for your help
Loading pretrained model using
pretrained
argument inmain.yaml
causes KeyError.Traceback (most recent call last): File "/.../parseq/train.py", line 83, in main trainer.fit(model, datamodule=datamodule) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 770, in fit self._call_and_handle_interrupt( File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 723, in _call_and_handle_interrupt return trainer_fn(*args, **kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 811, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1236, in _run results = self._run_stage() File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1323, in _run_stage return self._run_train() File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1345, in _run_train self._run_sanity_check() File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1413, in _run_sanity_check val_loop.run() File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/base.py", line 204, in run self.advance(*args, **kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 128, in advance output = self._evaluation_step(**kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 226, in _evaluation_step output = self.trainer._call_strategy_hook("validation_step", *kwargs.values()) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 1765, in _call_strategy_hook output = fn(*args, **kwargs) File "/home/user/.local/lib/python3.9/site-packages/pytorch_lightning/strategies/strategy.py", line 344, in validation_step return self.model.validation_step(*args, **kwargs) File "/.../parseq/strhub/models/base.py", line 153, in validation_step return self._eval_step(batch, True) File "/.../parseq/strhub/models/base.py", line 107, in _eval_step logits, loss, loss_numel = self.forward_logits_loss(images, labels) File "/.../parseq/strhub/models/base.py", line 177, in forward_logits_loss targets = self.tokenizer.encode(labels, self.device) File "/.../parseq/strhub/data/utils.py", line 115, in encode batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device) File "/.../parseq/strhub/data/utils.py", line 115, in <listcomp> batch = [torch.as_tensor([self.bos_id] + self._tok2ids(y) + [self.eos_id], dtype=torch.long, device=device) File "/.../parseq/strhub/data/utils.py", line 56, in _tok2ids return [self._stoi[s] for s in tokens] File "/.../parseq/strhub/data/utils.py", line 56, in <listcomp> return [self._stoi[s] for s in tokens] KeyError: '苏'
Seems that the
BaseTokenizer
class is using the default charset and is not overwritten, which then causes a KeyError to occur.Charset in
BaseTokenizer
:0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&'()*+,-./:;<=>?@[\]^_`{|}~
Overwriting the charset in the
__init__
seems to fix this issue:class BaseTokenizer(ABC): def __init__(self, charset: str, specials_first: tuple = (), specials_last: tuple = ()) -> None: charset = "皖P沪津渝冀晋蒙辽吉黑苏浙京闽赣鲁豫鄂川贵云藏陕甘青宁新警学" # pad this with any characters to 94 length to be compatible with the pretrained weights self._itos = specials_first + tuple(charset) + specials_last self._stoi = {s: i for i, s in enumerate(self._itos)}
I also train this model for recognizing chinese characters, but I train it from scratch. You can change charset_train
and charset_test
in the yaml
config file.
when I use the 62 charset to finetune the parseq, it cause the error. only use the 94 charset to finetune the model?
Now you can do:
# Finetuning ./train.py pretrained=parseq # parseq-tiny, etc. See released weights
How should the finetuning data be properly passed to the train.py at this point?
I set up the data using the process I mentioned in: #7
What is the simplest way to load pretrained weights into tune script in order to fine tune the pretrained weights?
I have trained parseq on synthetic data and now I want to use this checkpoint to further train the model with real data. I have got such checkpoint "epoch=412-step=114259-val_accuracy=96.8273-val_NED=97.6637.ckpt". I believe to finetune now, I will have use ./train.py ckpt_path=outputs/parseq/.../...NED=97.6637.ckpt
(refer to above comment )command to finetune. But I am geting the following error.
$ ./train.py ckpt_path=checkpoint/checkpoints/epoch=412-step=114259-val_accuracy=96.8273-val_NED=97.6637.ckpt
mismatched input '=' expecting <EOF>
See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Any recommendations to train or fine-tune model on new language.