YuzheZhang-1999 / DiffTSR

[CVPR2024] Diffusion-based Blind Text Image Super-Resolution (Official)
53 stars 2 forks source link

Bad Result #6

Open uhSuiL opened 3 weeks ago

uhSuiL commented 3 weeks ago

I used your model on my task, it seems no that good? I clipped the size of my img to (512, 128) following your size. The original input img is the first below followed by the result img. test task2_new Is there anything wrong? Below is my code:

from PIL import Image
from omegaconf import OmegaConf

from model.IDM.utils.util import instantiate_from_config

if __name__ == '__main__':
    DiffTSR_yaml_config = './model/DiffTSR_config.yaml'
    DiffTSR_ckpt_config = './ckpt/DiffTSR.ckpt'
    DiffTSR_config = OmegaConf.load(DiffTSR_yaml_config)
    DiffTSR_model = instantiate_from_config(DiffTSR_config.model)
    DiffTSR_model.load_model(DiffTSR_ckpt_config)
    print("Model Loaded")

    lq_image_pil = Image.open('./test.png').convert('RGB')
    lq_image_pil = lq_image_pil.resize((512, 128))

    # Start sampling!
    sr_output = DiffTSR_model.DiffTSR_sample(lq_image_pil)
    # Save sr image!
    sr_image_pil = Image.fromarray(sr_output, 'RGB')
    sr_image_pil.save('./task2_new.png')
uhSuiL commented 3 weeks ago

I worried if the input img should consist of only single line text, so I conducted 3 other tests, result seems not to meet the expectation: (Test below: I mask the second line) test2 test2_new (Test below: I cut out the second line and simply resize img to (512, 128) ) test3 test3_new (Test below: I cut out the second line and white margin in the first line and resize it to (512, 128) making it not that deformed) test4 test4_new I guess my images is not that hard to recognize text for human.

uhSuiL commented 3 weeks ago

I conducted another test: padding on left, right, top, bottom to keep the size (512, 128), leaving the text image centric and deformed. This is result: test5 test5_new

YuzheZhang-1999 commented 3 weeks ago

Thank you for your interest in this work. There are a few key points that require clarification.

First, this project is currently only applicable to single-line text images, with the input size limited to patches of 128x512, and the number of text characters is no more than 24. Therefore, for other images containing text patches, you should first detect the text line image from the original image using a text detection method like PaddleOCR, then crop and resize the patches to 128x512 and input them into the DiffTSR model.

Second, the text area in the cropped image should occupy the center. Usually, the text patches detected by the text detection model meet this condition. Additionally, the DiffTSR model is robust to text deformation.

Third, the DiffTSR model focuses on scene text images. We have not fully tested its performance in other scenarios, but it can be easily adapted for other scenarios with fine-tuning.

For more details, please refer to the main manuscript and the supplementary materials. Thanks for your interest, and we are also working on developing methods that are more adaptable.

uhSuiL commented 3 weeks ago

Thanks for your reply and appreciate your work.

I'm going to check your paper again and follow your proposal soon afterwards to see whether the model actually works or not in my case. Please keep this issue open, I think I could post my feedback here.