ankanbhunia / Handwriting-Transformers

Handwriting-Transformers (ICCV21)
MIT License
185 stars 42 forks source link

Model results #27

Open ewan-m opened 9 months ago

ewan-m commented 9 months ago

Hi!

I've been playing around with this model locally following the instructions in the README and my results don't seem to be nearly as good as yours.. I'm following your instructions here https://github.com/ankanbhunia/Handwriting-Transformers/issues/11#issuecomment-1081556415 and then running the prepare.py in my fork

For instance even with different style prompts the model seems to generate very similar results for me Real on left, Generated on right

IAM style 1 image-9-IAM

IAM style 2 image-6_1

and then secondly the CVL and the IAM models give very different results to each other. But also quite consistent results within the model itself for different styles

CVL style 1 image-9-CVL

CVL style2 image-6

Is there something stupid I'm missing or do I need to train it with these writers in the dataset to get better results? does the google drive contain the fully trained models that were used to generate the results in the paper?

Very cool project though - congrats!!

ankanbhunia commented 9 months ago

Thanks for sharing the results and your fork repo. I assume these examples are custom handwriting not from the IAM/CVL dataset.

Well, It seems from the results that the model does perform worse for the in-the-wild scenarios. Can you share here a zip file of the style examples used in the above example, so that I can test it on my machine, and then confirm if anything is missing?

ewan-m commented 9 months ago

of course! here's a zip of 30 example 32x192 pixel word pngs in style1 and style2

styles.zip

really appreciate the help btw!

I've got the start of a web app where you take a picture of a page of your writing, and then it uses OCR to cut and scale it all into these png files ready to feed into the model. My plan is to export the model as an ONNX and use onnx-runtime web to do the generation in the browser itself... If I can get some cool results locally first! 😁

ewan-m commented 9 months ago

So I've been playing about with it a bunch more and I've found that

which suggests that the model is very sensitive to the exact resolution / scaling / thresholding of the original dataset and doesn't handle anything being upscaled/downscaled differently to exactly how the source data was.

Do you reckon it's worth training it further with a bunch of slightly differently rotated/scaled data or do you reckon there's something else going wrong for me entirely? 😁

ankanbhunia commented 9 months ago

Sorry for the late reply.

I suppose you are correct, but I am unsure whether training with differently rotated/scaled data would be beneficial. Also, doing so might make the training unstable.

I haven't been able to test the results of your examples yet. It's been a busy week. I will give it a try over the weekend.

ankanbhunia commented 9 months ago
Screenshot 2024-02-16 at 17 31 27

@ewan-m,

I tried the model with your style1 samples. The results I got look not bad.

You can have a look at how I preprocessed the style examples in the load_itw_samples(.) function. Here, I use a minimum boundary area crop followed by a resize/padding operation.

https://github.com/ankanbhunia/Handwriting-Transformers/blob/f79913e3e1a536356761297cb55f9f4c1f99fcc8/data/dataset.py#L45

Also, I added a notebook file _demo_customhandwriting.ipynb. Here you just need to input the cropped word images of the custom handwriting. Images do NOT need to be scaled, resized, or padded. load_itw_samples(.) will take care of them.

I tried to find out why your results are poor. I think the preprocessing function, especially the minimum area cropping, is different in your case. I also found out that model.eval() was not called in the previous demo.ipynb file. That might have caused issues when inputting images outside the training corpus.

ewan-m commented 9 months ago

Thanks so much for this! I think adding the model.eval() is making some big differences and not having to care about resizing and padding is great too! I can confirm I reproduce the results you've shared above 😁

It's still quite hit and miss with other samples and seems sensitive to how the image is pre processed though, but I think that's to be expected.

I'm experimenting with iam vs cvl model and different pre-processing of images to find what seems to give the optimal results - preliminarily it seems some thresholding improves things greatly, but going for a full binary black or white thresholding makes things worse again, and leaving all the noise of the white page background is worst. Can share some images if you'd be interested!

ankanbhunia commented 9 months ago

Nice!

During training, we maintain a fixed receptive field of 16 for each character. So, to get optimal results try to resize the style images to [16*len(string)x32]. For example, a 4-character word 'door' should be of dimension [64x32]. This way can reduce the domain gap further.

shuangzhen361 commented 7 months ago

Hi,Why can’t I produce good results despite trying many methods?

1656efcb184ad66b08348d2842724d3
shuangzhen361 commented 7 months ago

Here is my zip file, thank you! image.zip