allenai / vila

Incorporating VIsual LAyout Structures for Scientific Text Classification
Apache License 2.0
167 stars 17 forks source link

Increase the robustness for "large" PDFs #16

Closed lolipopshock closed 2 years ago

lolipopshock commented 2 years ago

The fix tries to improve the robustness of the VILA library for "large" PDF -- the width or height of the PDF is more than 1000, and it has tokens with bounding box dimensions larger than 1000. In this case, the input will break the 2D position encoding process used in the base Transformer models, which is fundamentally a lookup table (bbox dimension value -> some embedding values) that only takes input from 0~1000.

I added a normalize function to solve this issue. When the input PDF size is "large" (i.e., either page_width>1000 or page_height>1000), it will normalize all the tokens in this page using the normalize_bbox function that coverts the dimension to the range 0~1000.

However, this solution is not perfect ~ our models hasn't been appropriately tuned for this large PDFs. Ideally, we should retrain such models with normalized inputs.

It will lead to one API change:

import layoutparser as lp # For visualization 

from vila.pdftools.pdf_extractor import PDFExtractor
from vila.predictors import HierarchicalPDFPredictor
# Choose from SimplePDFPredictor,
# LayoutIndicatorPDFPredictor, 
# and HierarchicalPDFPredictor

pdf_extractor = PDFExtractor("pdfplumber")
page_tokens, page_images = pdf_extractor.load_tokens_and_image(f"path-to-your.pdf")

vision_model = lp.EfficientDetLayoutModel("lp://PubLayNet") 
pdf_predictor = HierarchicalPDFPredictor.from_pretrained("allenai/hvila-row-layoutlm-finetuned-docbank")

for idx, page_token in enumerate(page_tokens):
    blocks = vision_model.detect(page_images[idx])
    page_token.annotate(blocks=blocks)
    pdf_data = page_token.to_pagedata().to_dict()
    predicted_tokens = pdf_predictor.predict(pdf_data, page_token.page_size) #<---- you need to specify the page size in the predict function! 
    lp.draw_box(page_images[idx], predicted_tokens, box_width=0, box_alpha=0.25)
lolipopshock commented 2 years ago

Now VILA works for large PDFs like poster: page_size is (2304.0, 2448.0) for the example below.

image

lolipopshock commented 2 years ago

For further reference, when we merge this issue, we'll also release v0.3.0 of vila due to changes in API.