Open sumairrasi opened 5 months ago
@sumairrasi That is an unexpected result, bboxes should come with text. Could you provide some of your test images? I want to test them on my server and try to figure out the issue. Also, I am very busy these days and I may not be able to reply to you until June 17th.
@sumairrasi Of course, you can finetune ViTLP with text-bbox sequences of correct format.
@sumairrasi . May I have one of your test images so I can check the model prediction? Many thanks for your help.
Sure, here is it. My question. I have a collection of these type of images How I only detect The customer name and Total balance.
@sumairrasi I tested the image and the result is as expected. Below are the test scripts:
pip install -r requirements.txt mkdir -p ckpts/ViTLP-medium git clone https://huggingface.co/veason/ViTLP-medium ckpts/ViTLP-medium python ocr.py
The decoded results and visualization are as follows.
{ {"text": "Example", "bbox": [54, 64, 116, 84]}, {"text": "Business", "bbox": [120, 64, 185, 81]}, {"text": "Name", "bbox": [190, 64, 232, 81]}, {"text": "or", "bbox": [237, 68, 252, 81]}, {"text": "Business", "bbox": [256, 64, 321, 81]}, {"text": "Owner", "bbox": [325, 64, 373, 81]}, {"text": "INVOICE", "bbox": [726, 66, 926, 100]}, {"text": "123", "bbox": [54, 85, 80, 101]}, {"text": "Example", "bbox": [85, 84, 147, 104]}, {"text": "Business", "bbox": [152, 84, 217, 101]}, {"text": "Address", "bbox": [221, 84, 280, 101]}, {"text": "Boston,", "bbox": [54, 103, 108, 122]}, {"text": "MA", "bbox": [112, 103, 137, 120]}, {"text": "02135", "bbox": [140, 103, 186, 120]}, {"text": "Example", "bbox": [54, 276, 116, 296]}, {"text": "Customer.", "bbox": [120, 276, 192, 293]}, {"text": "Name", "bbox": [195, 276, 238, 292]}, {"text": "Invoice", "bbox": [727, 276, 782, 293]}, {"text": "#", "bbox": [786, 276, 796, 293]}, {"text": "123456", "bbox": [875, 278, 928, 295]}, {"text": "100", "bbox": [56, 302, 80, 314]}, {"text": "Exampre", "bbox": [85, 301, 147, 316]}, {"text": "ouestoffner", "bbox": [153, 300, 221, 313]}, {"text": "Address", "bbox": [225, 299, 284, 313]}, {"text": "Boston,", "bbox": [54, 315, 108, 334]}, {"text": "MA", "bbox": [112, 315, 137, 332]}, {"text": "02135", "bbox": [141, 315, 186, 332]}, {"text": "Invoice", "bbox": [700, 322, 755, 339]}, {"text": "Date", "bbox": [760, 322, 794, 339]}, {"text": "08/19/2020", "bbox": [847, 323, 928, 339]}, {"text": "Due", "bbox": [725, 367, 755, 384]}, {"text": "Date", "bbox": [760, 367, 794, 384]}, {"text": "09/19/2020", "bbox": [847, 367, 928, 384]}, {"text": "Item", "bbox": [69, 463, 104, 481]}, {"text": "Description", "bbox": [170, 463, 264, 485]}, {"text": "Unit", "bbox": [602, 463, 636, 481]}, {"text": "Price", "bbox": [641, 463, 683, 481]}, {"text": "Quantity", "bbox": [730, 463, 800, 485]}, {"text": "Amount", "bbox": [853, 463, 919, 481]}, {"text": "Service", "bbox": [66, 516, 115, 532]}, {"text": "Example", "bbox": [170, 516, 227, 535]}, {"text": "of", "bbox": [231, 516, 244, 532]}, {"text": "service", "bbox": [248, 516, 294, 532]}, {"text": "in", "bbox": [298, 516, 310, 532]}, {"text": "industry", "bbox": [314, 516, 366, 535]}, {"text": "25.00", "bbox": [644, 516, 682, 532]}, {"text": "4.00", "bbox": [770, 516, 799, 532]}, {"text": "100.00", "bbox": [876, 516, 921, 532]}, {"text": "Product", "bbox": [66, 562, 117, 578]}, {"text": "Example", "bbox": [170, 562, 227, 581]}, {"text": "of", "bbox": [231, 562, 244, 578]}, {"text": "product", "bbox": [248, 562, 298, 581]}, {"text": "in", "bbox": [301, 563, 312, 578]}, {"text": "industry", "bbox": [316, 562, 369, 581]}, {"text": "500.00", "bbox": [636, 563, 682, 578]}, {"text": "1.00", "bbox": [770, 563, 799, 578]}, {"text": "500.00", "bbox": [875, 563, 921, 578]}, {"text": "Discount", "bbox": [66, 607, 124, 623]}, {"text": "Example", "bbox": [170, 607, 227, 627]}, {"text": "of", "bbox": [231, 607, 244, 623]}, {"text": "discount", "bbox": [247, 607, 303, 623]}, {"text": "in", "bbox": [307, 608, 318, 623]}, {"text": "industry", "bbox": [323, 607, 375, 626]}, {"text": "-100.00", "bbox": [631, 608, 682, 623]}, {"text": "1.00", "bbox": [770, 608, 799, 623]}, {"text": "-100.00", "bbox": [870, 608, 921, 623]}, {"text": "NOTES:", "bbox": [66, 745, 125, 762]}, {"text": "Provide", "bbox": [131, 745, 184, 762]}, {"text": "a", "bbox": [189, 749, 198, 762]}, {"text": "concise,", "bbox": [202, 746, 260, 765]}, {"text": "professional", "bbox": [266, 746, 352, 765]}, {"text": "description", "bbox": [356, 746, 434, 765]}, {"text": "of", "bbox": [438, 745, 452, 762]}, {"text": "the", "bbox": [456, 746, 478, 762]}, {"text": "services,", "bbox": [483, 746, 545, 764]}, {"text": "product,", "bbox": [551, 746, 608, 765]}, {"text": "and", "bbox": [613, 746, 639, 762]}, {"text": "discount", "bbox": [644, 746, 704, 762]}, {"text": "listed", "bbox": [708, 745, 746, 762]}, {"text": "above.", "bbox": [750, 746, 797, 762]}, {"text": "Subtotal", "bbox": [613, 821, 678, 838]}, {"text": "600.00", "bbox": [872, 821, 921, 838]}, {"text": "Total", "bbox": [613, 860, 652, 877]}, {"text": "500.00", "bbox": [872, 861, 921, 877]}, {"text": "Amount", "bbox": [613, 893, 674, 910]}, {"text": "Paid", "bbox": [679, 893, 712, 910]}, {"text": "0.00", "bbox": [889, 894, 921, 910]}, {"text": "Balance", "bbox": [613, 936, 675, 953]}, {"text": "Due", "bbox": [680, 936, 710, 953]}, {"text": "$500.00", "bbox": [862, 936, 921, 954]} }
Seems that everything is fine. Could you pls follow the above test script and try again?
I also encountered a similar issue and would like to make some modifications and retrain the model from the scratch. Inspired by this model, omniparser, and UNITS, I am thinking of replacing the "loc" token with a center point token and pretraining the model to control the ROI.
cxn, cyn, minx, miny, maxx, maxy ~ from <0> to <1000>,
- For the full region:
<bos> <0>, <0>, <1000>, <1000>, <start> cx1, cy1, w11, w12, cx2, cy2, w2, ...
- For the full region and continued generation:
<bos> <0>, <0>, <1000>, <1000>, <continued> cxk, cyk, wk, (<-prompt , output-> ) ...
- For ROI:
<bos> minx, miny, maxx, maxy <start> cxj, cyj, wj, ...
- For ROI and continued generation:
<bos> minx, miny, maxx, maxy <continued> cxj+n, cyj+n, wj+n, ...
I also encountered a similar issue and would like to make some modifications and retrain the model from the scratch. Inspired by this model, omniparser, and UNITS, I am thinking of replacing the "loc" token with a center point token and pretraining the model to control the ROI.
cxn, cyn, minx, miny, maxx, maxy ~ from <0> to <1000>,
- For the full region: <bos> <0>, <0>, <1000>, <1000>, <start> cx1, cy1, w11, w12, cx2, cy2, w2, ... - For the full region and continued generation: <bos> <0>, <0>, <1000>, <1000>, <continued> cxk, cyk, wk, (<-prompt , output-> ) ... - For ROI: <bos> minx, miny, maxx, maxy <start> cxj, cyj, wj, ... - For ROI and continued generation: <bos> minx, miny, maxx, maxy <continued> cxj+n, cyj+n, wj+n, ...
Yes, this idea could work. ViTLP focuses only on grounding capability, while the mentioned above referring capability should also be developed. Full localization capabilities should include both, i.e., localization = grounding + referring.
Thanks for your great work! I'm trying to finetune your model on my dataset to do OCR and localization. I do have bounding boxes and texts, but what are other things and steps needed to finetune ViTLP?
Hi, @sbernabel . Thanks for your attention.
For datasets. I recommend referring to https://github.com/Veason-silverbullet/ViTLP/blob/main/dataset/pretrain.py to arrange the fine-tuning dataset.
For trainer. I recommend referring to the setting listed in our paper.
Since I am busy these two weeks, I plan to arrange dataset samples (and maybe trainer codes) in the next two weekends.
Hi @sumairrasi , I've prepared the finetuning code at https://github.com/Veason-silverbullet/ViTLP/tree/main/finetuning. Please check it out.
Its really great work, I have a doubt, I have some complex document images, I tested this on it. gives all text results, what if I want only some specific text position in my image