NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.18k stars 1.41k forks source link

Recreating DocVQA results for LayoutLMv2 #49

Open ArmiNouri opened 2 years ago

ArmiNouri commented 2 years ago

Related issue on the unilm repo.

I'm trying to recreate the results reported in the LayoutLMv2 paper, Table 6, row 7. Following this example, I've fine-tuned the base model with DocVQA training set for 20 epochs. The resulting model is under-performing compared to what's reported in the paper (roughly 40% of answers default to [CLS]). I'm wondering whether:

NielsRogge commented 2 years ago

Hi!

The number of epochs was set arbitrary, for demo purposes only.

Apparently, the Microsoft authors used a couple of tricks (which they didn't share) in order to come up with the results on DocVQA as reported in the paper.

I personally also wonder how they managed to get such a high score, as LayoutLMv2 requires an external OCR engine, which would work quite badly on handwritten documents. However, with new models such as TrOCR, this might become easier.

ArmiNouri commented 2 years ago

Thank you for the quick reply. I've followed up with the authors and will share if I find anything out. Your notebooks really helped and are a great resource. Thank you.

anupamadeo commented 2 years ago

while implementing layoutLMv2 for DocVQA I am not able to use LayoutLMv2FeatureExtractor and create Dataset_with_ocr. I am getting the following error: ArrowNotImplementedError: Unsupported cast from list<item: list<item: list>> to utf8 using function cast_string not able to understand it or why is it happening. please help

ArmiNouri commented 2 years ago

@anupamadeo if I recall correctly I had the same issue. It was because the mapping function was trying to recast the image column to a new type. What helped me was using a temporary new column (images) and casting it back to image at the end of the process.

feature_extractor = LayoutLMv2FeatureExtractor()

def get_ocr_words_and_boxes(examples):
    # get a batch of document images
    images = [Image.open(root_dir + image_file).convert("RGB") for image_file in examples['image']]
    # resize every image to 224x224 + apply tesseract to get words + normalized boxes
    encoded_inputs = feature_extractor(images)
    examples['images'] = encoded_inputs.pixel_values
    examples['words'] = encoded_inputs.words
    examples['boxes'] = encoded_inputs.boxes
    return examples

dataset_with_ocr = dataset.map(lambda x: get_ocr_words_and_boxes(x), batched=True, batch_size=10)
dataset_with_ocr = dataset_with_ocr.map(lambda example: {'image': example['images']}, remove_columns=['images'])
anupamadeo commented 2 years ago

Thanks for such a quick reply. It solved my problem.

anupamadeo commented 2 years ago

Hi, Is there any way to train the tokenizer in LatoutLMv2 for domain specific vocabulary?

sujit420 commented 2 years ago

Has anyone reached the scores reported by Microsoft/layoutlmv2 on docvqa? I was able to train the model on train data. But my NLS scores are quite low on val data from docvqa. I was only able to get around 40. @NielsRogge @tiennvcs @ArmiNouri

anupamadeo commented 2 years ago

I am trying LayoutLMv2 on a new dataset for question answering using the same code given in the notebook. I wanted to train and test it but not able to create batches for test. I am also new to pytorch , I have worked with tensorflow. Kindly help

sujit420 commented 2 years ago

not able to create batches f

you have to post your error for any help

dongxuewang-123 commented 2 years ago

I submit my result on DocVQA website today,but it dosen‘t has the score,has anyone know the reason?

herobd commented 2 years ago

Would anyone mind reporting your best ANLS scores using Tesseract?

@dongxuewang-123 their server had a bug which should be fixed now