how to extract entire text for given image using Donut pretrained model?

clovaai / donut

Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022

https://arxiv.org/abs/2111.15664

MIT License

5.78k stars 469 forks source link

how to extract entire text for given image using Donut pretrained model? #62

Open alenma05 opened 2 years ago

satheeshkatipomu commented 2 years ago

I think it is straightforward to get text output using donut-base model. Load naver-clova-ix/donut-base from huggingface and use <s_synthdog> as prompt.

from donut import DonutModel
import torch
from PIL import Image

pretrained_model = DonutModel.from_pretrained("naver-clova-ix/donut-base")
if torch.cuda.is_available():
    pretrained_model.half()
    device = torch.device("cuda")
    pretrained_model.to(device)
else:
    pretrained_model.encoder.to(torch.bfloat16)
pretrained_model.eval()

task_name = "synthdog"
task_prompt = f"<s_{task_name}>"

input_img = Image.open("./test_images/example1.png")
output = pretrained_model.inference(image=input_img, prompt=task_prompt)["predictions"][0]
print(output)

alenma05 commented 2 years ago

Thank you for your help.

yagcaglar commented 1 year ago

Hi, thanks for the repo and the code. When I use this code in my dataset, I got dummy text from some of the documents such as RT @ RT @ RT @ RT @ RT @ RT @ RT @ RT @ RT @ or 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 or ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) or | | | | | | | | | | | | | | | | | | | | | where each output is from different images. Might it be related to resolution of the input? Can I change something to prevent this?

gamingflexer commented 1 year ago

Getting this issue on cpu

llStringll commented 1 year ago

Your model is running in fp16 half precision floating points, but the image you passed is a full precision float. Either change the model to run on full precision or change the image to half precision floating point numpy array

swarnalatapatra commented 1 year ago

I am getting attribute error . can anyone please suggest ?

llStringll commented 1 year ago

Can you share initialisation of pretrained_model variable please, also if possible, are you using this repo's implementation over SwinTransformer or Huggingface's? Also, please share the version of huggingface/transformers that you are using.

Your issue arises from this line this line this was an attribute in earlier versions of SwinTransformer, it dsnt exist in current Transformers versions. So you can easily find ways around it. I would recommend using Huggingface implementation purely, this repo has some peculiar differences than the research paper also, causing unwanted issues.

swarnalatapatra commented 1 year ago

I am using the below for pretrained model where task_name is docvqa

parser = argparse.ArgumentParser() parser.add_argument("--task", type=str, default="docvqa") parser.add_argument("--pretrained_path", type=str, default="naver-clova-ix/donut-base") args, left_argv = parser.parse_known_args()

task_name = args.task if "docvqa" == task_name: task_prompt = "{user_input}" else: # rvlcdip, cord, ... taskprompt = f"<s{task_name}>"

pretrained_model = DonutModel.from_pretrained(args.pretrained_path, ignore_mismatched_sizes=True)

if torch.cuda.is_available(): pretrained_model.half() device = torch.device("cuda") pretrained_model.to(device) else: pretrained_model.encoder.to(torch.float32)

pretrained_model.eval()

llStringll commented 1 year ago

What is the version of timm that you are using? It is inconsistent with this repo, specifically SwinTransformer class from timm module is no longer having an attribute pos_drop, a dropout for position embeddedings...Please try it with a consistent version of timm, as per this repo's readme. Ensure this setup atleast for timm module and immediate dependence modules:

torch == 1.11.0+cu113 torchvision == 0.12.0+cu113 pytorch-lightning == 1.6.4 transformers == 4.11.3 timm == 0.5.4

llStringll commented 1 year ago

If you refer here see, just last month they refactored these attributes, definitely including pos_drop argument....and your timm version is definitely very recent....downgrade it to something like 0.5.4, and it should work pretty fine.

aindilis commented 1 year ago

Your model is running in fp16 half precision floating points, but the image you passed is a full precision float. Either change the model to run on full precision or change the image to half precision floating point numpy array

I am new to Python, but after much tinkering was unable to figure out how to change the image to half precision floating point numpy array. I would greatly appreciate if someone could please post the code to make the conversion here.