Veason-silverbullet / ViTLP

[NAACL 2024] Visually Guided Generative Text-Layout Pre-training for Document Intelligence
MIT License
46 stars 2 forks source link

Can we finetune or train this our own data #1

Open sumairrasi opened 5 months ago

sumairrasi commented 5 months ago

Its really great work, I have a doubt, I have some complex document images, I tested this on it. gives all text results, what if I want only some specific text position in my image

Veason-silverbullet commented 5 months ago

@sumairrasi That is an unexpected result, bboxes should come with text. Could you provide some of your test images? I want to test them on my server and try to figure out the issue. Also, I am very busy these days and I may not be able to reply to you until June 17th.

Veason-silverbullet commented 5 months ago

@sumairrasi Of course, you can finetune ViTLP with text-bbox sequences of correct format.

Veason-silverbullet commented 4 months ago

@sumairrasi . May I have one of your test images so I can check the model prediction? Many thanks for your help.

sumairrasi commented 4 months ago

Sure, here is it. My question. I have a collection of these type of images How I only detect The customer name and Total balance.

receipt_3_

Veason-silverbullet commented 4 months ago

@sumairrasi I tested the image and the result is as expected. Below are the test scripts:

pip install -r requirements.txt
mkdir -p ckpts/ViTLP-medium
git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium
python ocr.py

The decoded results and visualization are as follows.

{
    {"text": "Example", "bbox": [54, 64, 116, 84]},
    {"text": "Business", "bbox": [120, 64, 185, 81]},
    {"text": "Name", "bbox": [190, 64, 232, 81]},
    {"text": "or", "bbox": [237, 68, 252, 81]},
    {"text": "Business", "bbox": [256, 64, 321, 81]},
    {"text": "Owner", "bbox": [325, 64, 373, 81]},
    {"text": "INVOICE", "bbox": [726, 66, 926, 100]},
    {"text": "123", "bbox": [54, 85, 80, 101]},
    {"text": "Example", "bbox": [85, 84, 147, 104]},
    {"text": "Business", "bbox": [152, 84, 217, 101]},
    {"text": "Address", "bbox": [221, 84, 280, 101]},
    {"text": "Boston,", "bbox": [54, 103, 108, 122]},
    {"text": "MA", "bbox": [112, 103, 137, 120]},
    {"text": "02135", "bbox": [140, 103, 186, 120]},
    {"text": "Example", "bbox": [54, 276, 116, 296]},
    {"text": "Customer.", "bbox": [120, 276, 192, 293]},
    {"text": "Name", "bbox": [195, 276, 238, 292]},
    {"text": "Invoice", "bbox": [727, 276, 782, 293]},
    {"text": "#", "bbox": [786, 276, 796, 293]},
    {"text": "123456", "bbox": [875, 278, 928, 295]},
    {"text": "100", "bbox": [56, 302, 80, 314]},
    {"text": "Exampre", "bbox": [85, 301, 147, 316]},
    {"text": "ouestoffner", "bbox": [153, 300, 221, 313]},
    {"text": "Address", "bbox": [225, 299, 284, 313]},
    {"text": "Boston,", "bbox": [54, 315, 108, 334]},
    {"text": "MA", "bbox": [112, 315, 137, 332]},
    {"text": "02135", "bbox": [141, 315, 186, 332]},
    {"text": "Invoice", "bbox": [700, 322, 755, 339]},
    {"text": "Date", "bbox": [760, 322, 794, 339]},
    {"text": "08/19/2020", "bbox": [847, 323, 928, 339]},
    {"text": "Due", "bbox": [725, 367, 755, 384]},
    {"text": "Date", "bbox": [760, 367, 794, 384]},
    {"text": "09/19/2020", "bbox": [847, 367, 928, 384]},
    {"text": "Item", "bbox": [69, 463, 104, 481]},
    {"text": "Description", "bbox": [170, 463, 264, 485]},
    {"text": "Unit", "bbox": [602, 463, 636, 481]},
    {"text": "Price", "bbox": [641, 463, 683, 481]},
    {"text": "Quantity", "bbox": [730, 463, 800, 485]},
    {"text": "Amount", "bbox": [853, 463, 919, 481]},
    {"text": "Service", "bbox": [66, 516, 115, 532]},
    {"text": "Example", "bbox": [170, 516, 227, 535]},
    {"text": "of", "bbox": [231, 516, 244, 532]},
    {"text": "service", "bbox": [248, 516, 294, 532]},
    {"text": "in", "bbox": [298, 516, 310, 532]},
    {"text": "industry", "bbox": [314, 516, 366, 535]},
    {"text": "25.00", "bbox": [644, 516, 682, 532]},
    {"text": "4.00", "bbox": [770, 516, 799, 532]},
    {"text": "100.00", "bbox": [876, 516, 921, 532]},
    {"text": "Product", "bbox": [66, 562, 117, 578]},
    {"text": "Example", "bbox": [170, 562, 227, 581]},
    {"text": "of", "bbox": [231, 562, 244, 578]},
    {"text": "product", "bbox": [248, 562, 298, 581]},
    {"text": "in", "bbox": [301, 563, 312, 578]},
    {"text": "industry", "bbox": [316, 562, 369, 581]},
    {"text": "500.00", "bbox": [636, 563, 682, 578]},
    {"text": "1.00", "bbox": [770, 563, 799, 578]},
    {"text": "500.00", "bbox": [875, 563, 921, 578]},
    {"text": "Discount", "bbox": [66, 607, 124, 623]},
    {"text": "Example", "bbox": [170, 607, 227, 627]},
    {"text": "of", "bbox": [231, 607, 244, 623]},
    {"text": "discount", "bbox": [247, 607, 303, 623]},
    {"text": "in", "bbox": [307, 608, 318, 623]},
    {"text": "industry", "bbox": [323, 607, 375, 626]},
    {"text": "-100.00", "bbox": [631, 608, 682, 623]},
    {"text": "1.00", "bbox": [770, 608, 799, 623]},
    {"text": "-100.00", "bbox": [870, 608, 921, 623]},
    {"text": "NOTES:", "bbox": [66, 745, 125, 762]},
    {"text": "Provide", "bbox": [131, 745, 184, 762]},
    {"text": "a", "bbox": [189, 749, 198, 762]},
    {"text": "concise,", "bbox": [202, 746, 260, 765]},
    {"text": "professional", "bbox": [266, 746, 352, 765]},
    {"text": "description", "bbox": [356, 746, 434, 765]},
    {"text": "of", "bbox": [438, 745, 452, 762]},
    {"text": "the", "bbox": [456, 746, 478, 762]},
    {"text": "services,", "bbox": [483, 746, 545, 764]},
    {"text": "product,", "bbox": [551, 746, 608, 765]},
    {"text": "and", "bbox": [613, 746, 639, 762]},
    {"text": "discount", "bbox": [644, 746, 704, 762]},
    {"text": "listed", "bbox": [708, 745, 746, 762]},
    {"text": "above.", "bbox": [750, 746, 797, 762]},
    {"text": "Subtotal", "bbox": [613, 821, 678, 838]},
    {"text": "600.00", "bbox": [872, 821, 921, 838]},
    {"text": "Total", "bbox": [613, 860, 652, 877]},
    {"text": "500.00", "bbox": [872, 861, 921, 877]},
    {"text": "Amount", "bbox": [613, 893, 674, 910]},
    {"text": "Paid", "bbox": [679, 893, 712, 910]},
    {"text": "0.00", "bbox": [889, 894, 921, 910]},
    {"text": "Balance", "bbox": [613, 936, 675, 953]},
    {"text": "Due", "bbox": [680, 936, 710, 953]},
    {"text": "$500.00", "bbox": [862, 936, 921, 954]}
}
Screenshot 2024-06-20 at 4 17 28 PM

Seems that everything is fine. Could you pls follow the above test script and try again?

qutrino commented 4 months ago

I also encountered a similar issue and would like to make some modifications and retrain the model from the scratch. Inspired by this model, omniparser, and UNITS, I am thinking of replacing the "loc" token with a center point token and pretraining the model to control the ROI.

cxn, cyn, minx, miny, maxx, maxy ~ from <0> to <1000>,

- For the full region: 
<bos> <0>, <0>, <1000>, <1000>, <start> cx1, cy1, w11, w12, cx2, cy2, w2, ...

- For the full region and continued generation:
<bos> <0>, <0>, <1000>, <1000>, <continued> cxk, cyk, wk, (<-prompt , output-> ) ...

- For ROI: 
<bos> minx, miny, maxx, maxy <start> cxj, cyj, wj, ...

- For ROI and continued generation: 
<bos> minx, miny, maxx, maxy <continued> cxj+n, cyj+n, wj+n, ...
Veason-silverbullet commented 4 months ago

I also encountered a similar issue and would like to make some modifications and retrain the model from the scratch. Inspired by this model, omniparser, and UNITS, I am thinking of replacing the "loc" token with a center point token and pretraining the model to control the ROI.

cxn, cyn, minx, miny, maxx, maxy ~ from <0> to <1000>,

- For the full region: 
<bos> <0>, <0>, <1000>, <1000>, <start> cx1, cy1, w11, w12, cx2, cy2, w2, ...

- For the full region and continued generation:
<bos> <0>, <0>, <1000>, <1000>, <continued> cxk, cyk, wk, (<-prompt , output-> ) ...

- For ROI: 
<bos> minx, miny, maxx, maxy <start> cxj, cyj, wj, ...

- For ROI and continued generation: 
<bos> minx, miny, maxx, maxy <continued> cxj+n, cyj+n, wj+n, ...

Yes, this idea could work. ViTLP focuses only on grounding capability, while the mentioned above referring capability should also be developed. Full localization capabilities should include both, i.e., localization = grounding + referring.

sbernabel commented 3 months ago

Thanks for your great work! I'm trying to finetune your model on my dataset to do OCR and localization. I do have bounding boxes and texts, but what are other things and steps needed to finetune ViTLP?

Veason-silverbullet commented 3 months ago

Hi, @sbernabel . Thanks for your attention.

Since I am busy these two weeks, I plan to arrange dataset samples (and maybe trainer codes) in the next two weekends.

Veason-silverbullet commented 3 months ago

Hi @sumairrasi , I've prepared the finetuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning. Please check it out.