PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.12k stars 2.94k forks source link

[Question]: Will English Docvqa dataset work for finetuning? (Ernie-Layout) #4599

Closed JP-Leite closed 3 months ago

JP-Leite commented 1 year ago

请提出你的问题

Hello,

Wanted to know if the DocVqa dataset in English can be used for finetuning(Ernie-Layout) or if the dataset needs to be modified or transformed in any way.

DocVQA

paulpaul91 commented 1 year ago

请提出你的问题

Hello,

Wanted to know if the DocVqa dataset in English can be used for finetuning(Ernie-Layout) or if the dataset needs to be modified or transformed in any way.

DocVQA

Converting DocVQA dataset to our dataset format, then using MRC format to train. https://github.com/PaddlePaddle/PaddleNLP/blob/develop/model_zoo/ernie-layout/run_mrc.py

JP-Leite commented 1 year ago

Thank you for the prompt response. Looking at the Funsd and DocVQAzh datasets, it seems like we have OCR'd tokens from the image, their bound box and then segment_id which looks to be groupings of tokens.

If we wanted to build a new dataset from scratch with new images, do you have any tips on what to use to create the segment_id for token groupings?

paulpaul91 commented 1 year ago
  1. the segment boxes can be obtained through OCR Inference Example usage on PaddleOCR
  2. in our experiments, segment boxes increase the F1 on NER task, but has no obvious improvement on MRC and Classification tasks.
logan-markewich commented 1 year ago

@JP-Leite For fine-tuning, I found this repo helpful. Fine-tuning on english DocVQA worked very well

JP-Leite commented 1 year ago
  1. the segment boxes can be obtained through OCR Inference Example usage on PaddleOCR
  2. in our experiments, segment boxes increase the F1 on NER task, but has no obvious improvement on MRC and Classification tasks.

Thanks for the reply, this maybe considered a simple question but in looking at some other threads on the subject 3828, as I understand it to leverage a new dataset I would need to:

Lastly, if the use_segment_box false is set during traiing, do I still need to include that section in the as part of the dataset?

JP-Leite commented 1 year ago

@JP-Leite For fine-tuning, I found this repo helpful. Fine-tuning on english DocVQA worked very well

I saw this a few days ago but was a bit hesitant to dive deeper as there is even less documentation on that side on how to train and create datasets to use but I might have just missed it. Do you mind sharing your DocVQA-en finetuning checkpoint so I can play around with it? If you have the scripts you used to transform the DocVQA dataset and fine-tune that would be cool to see as well!

logan-markewich commented 1 year ago

@JP-Leite I also trained on a bunch of propriety data, so I can't share the checkpoint :( Training is very easy, because you can use it just like a normal huggingface model (if you aren't familiar with huggingface yet, it is a must for AI 💯 )

Here's a notebook showing how to fine-tune on DocVQA. You just need to replace the model and feature extractor with ErnieLayout and it will work. In the training loop, you just have to replace occurrences of "image" with "pixel_values". It looks a little scary but it's pretty straightforward.

Also note that in the notebook they apply OCR to the dataset using tesseract, rather than using the supplied boxes/tokens. This is optional, but it does improve performance as it will get the model used to the output of your chosen OCR service.

Good luck!

JP-Leite commented 1 year ago

I've written a small script to convert the DocVQA format into a format that mimicks the DocVQA_zh format. Within the conversion zip, the val.json provides some examples of the conversion. Conversion.zip

So far, I have had success in my environment in running both standard NER and MRC fine tuning. When I attempt to fine tune on my DocVQA(english) converted data, I continue to receive the following error:

 0%|                                     | 81/29118 [02:05<12:40:05,  1.57s/it]Exception in thread Thread-3:
Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 696, in convert_to_tensors
    tensor = as_tensor(value)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/decorator.py", line 232, in fun
    return caller(func, *(extras + args), **kw)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/wrapped_decorator.py", line 25, in __impl__
    return wrapped_func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/framework.py", line 434, in __impl__
    return func(*args, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/tensor/creation.py", line 184, in to_tensor
    return paddle.Tensor(
OSError: (External) CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:258)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop
    batch = self._dataset_fetcher.fetch(indices,
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dataloader/fetcher.py", line 134, in fetch
    data = self.collate_fn(data)
  File "/home/ubuntu/PaddleNLP/model_zoo/ernie-layout/data_collator.py", line 70, in __call__
    batch = self.tokenizer.pad(
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2602, in pad
    return BatchEncoding(batch_outputs, tensor_type=return_tensors)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 227, in __init__
    self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 705, in convert_to_tensors
    raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Traceback (most recent call last):
  File "run_mrc.py", line 242, in <module>
    main()
  File "run_mrc.py", line 210, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 661, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1316, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1278, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 1086, in forward
    outputs = self.ernie_layout(
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 930, in __call__
    return self._dygraph_call_func(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 915, in _dygraph_call_func
    outputs = self.forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 716, in forward
    visual_bbox = self._calc_visual_bbox(self.config["image_feature_pool_shape"], bbox, visual_shape)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 657, in _calc_visual_bbox
    visual_bbox = paddle.stack(
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 903, in stack
    return layers.stack(x, axis, name)
  File "/home/ubuntu/miniconda3/envs/Ernie/lib/python3.8/site-packages/paddle/fluid/layers/nn.py", line 10397, in stack
    return _C_ops.stack(x, 'axis', axis)
OSError: (External) CUDA error(700), an illegal memory access was encountered. 
  [Hint: 'cudaErrorIllegalAddress'. The device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistentstate and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched. ] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:251)

I'm certain this has something to do with how I am converting the English DocVQA files but not sure how to troubleshoot. Is there a way to turn on more verbose logging?

logan-markewich commented 1 year ago

@JP-Leite debugging will be much easier if you run on CPU (running on CPU gives better errors in my experience)

JP-Leite commented 1 year ago

@JP-Leite debugging will be much easier if you run on CPU (running on CPU gives better errors in my experience)

Here is output from CPU:

  File "run_mrc.py", line 242, in <module>
    main()
  File "run_mrc.py", line 210, in main
    train_result = trainer.train(resume_from_checkpoint=checkpoint)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 661, in train
    tr_loss_step = self.training_step(model, inputs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1316, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1278, in compute_loss
    outputs = model(**inputs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 1086, in forward
    outputs = self.ernie_layout(
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 742, in forward
    text_layout_emb = self._calc_text_embeddings(
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 612, in _calc_text_embeddings
    x1, y1, x2, y2, h, w = self.embeddings._cal_spatial_position_embeddings(bbox)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 140, in _cal_spatial_position_embeddings
    w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in __call__
    return self.forward(*inputs, **kwargs)
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/nn/layer/common.py", line 1517, in forward
    return F.embedding(
  File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/nn/functional/input.py", line 203, in embedding
    return _C_ops.embedding(x, weight, padding_idx, sparse)
ValueError: (InvalidArgument) Variable value (input) of OP(fluid.layers.embedding) expected >= 0 and < 1024, but got -17. Please check input value.
  [Hint: Expected ids[i] >= 0, but received ids[i]:-17 < 0:0.] (at /paddle/paddle/phi/kernels/cpu/embedding_kernel.cc:76)

Receive a different error but it happens instantly.

Trying to eliminate that it is something environmental, but since I was able to run both FUNSD and DocVQA_ZH without any issue, I still think my problem is the custom dataset.

My environment: Driver Version: 515.43.04
CUDA Version: 11.7 GPU: NVIDIA Corporation TU104GL [Tesla T4] (rev a1) OS: Ubuntu 20.04.5 LTS x86_64 gcc: 9.4.0 Cudnn: 8.4.1 paddlenlp: 2.5.0
paddlepaddle-gpu: 2.4.1.post117

Approach was

Get some form of this error:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 696, in convert_to_tensors
tensor = as_tensor(value)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/tensor/creation.py", line 546, in to_tensor
return _to_tensor_non_static(data, dtype, place, stop_gradient)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/tensor/creation.py", line 405, in _to_tensor_non_static
return core.eager.Tensor(
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:259)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dataloader/dataloader_iter.py", line 217, in _thread_loop
batch = self._dataset_fetcher.fetch(indices,
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dataloader/fetcher.py", line 138, in fetch
data = self.collate_fn(data)
File "/home/ubuntu/PaddleNLP/model_zoo/ernie-layout/data_collator.py", line 70, in call
batch = self.tokenizer.pad(
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 2602, in pad
return BatchEncoding(batch_outputs, tensor_type=return_tensors)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 227, in init
self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/tokenizer_utils_base.py", line 705, in convert_to_tensors
raise ValueError(
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length.
Traceback (most recent call last):
File "run_mrc.py", line 242, in
main()
File "run_mrc.py", line 210, in main
train_result = trainer.train(resume_from_checkpoint=checkpoint)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 661, in train
tr_loss_step = self.training_step(model, inputs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1316, in training_step
loss = self.compute_loss(model, inputs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/trainer/trainer.py", line 1278, in compute_loss
outputs = model(**inputs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in call
return self.forward(*inputs, **kwargs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 1086, in forward
outputs = self.ernie_layout(
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 948, in call
return self.forward(*inputs, **kwargs)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 716, in forward
visual_bbox = self._calc_visual_bbox(self.config["image_feature_pool_shape"], bbox, visual_shape)
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddlenlp/transformers/ernie_layout/modeling.py", line 657, in _calc_visual_bbox
visual_bbox = paddle.stack(
File "/home/ubuntu/miniconda3/envs/PaddleNLP/lib/python3.8/site-packages/paddle/tensor/manipulation.py", line 1839, in stack
return _C_ops.stack(x, axis)
OSError: (External) CUDA error(700), an illegal memory access was encountered.
[Hint: Please search for the error code(700) on website (https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g3f51e3575c2178246db0a94a430e0038) to get Nvidia's official solution and advice about CUDA Error.] (at /paddle/paddle/phi/backends/gpu/cuda/cuda_info.cc:252)

For anyone that wants to try to reproduce these results, here is a small subset of the docvqa_en dataset that I converted. Just place it locally and update the path in the docvqa_en.py script to run.

docvqa_en.tar.gz

logan-markewich commented 1 year ago

From the error, I think it is a problem with your bounding boxes. I see bbox values > 1000 in the file, and I don't see any normalization inside run_mrc.py

Maybe double check the bboxes in docvqa_zh (I can't seem to access it). They might be pre-normalized

For reference, the bboxes should be normalized between 0-1000, something like this:

def normalize_box(box, width, height):
    return [
        int(1000 * (box[0] / width)),
        int(1000 * (box[1] / height)),
        int(1000 * (box[2] / width)),
        int(1000 * (box[3] / height)),
    ]
JP-Leite commented 1 year ago

I will give it a shot but for reference, here is the first record from the docvqa_zh dataset which has bbox over 1000 as well. Sample_docvqa_zh.json.zip

logan-markewich commented 1 year ago

hmmm then, that's probably not the issue then. Although it still seems related, because somehow -17 is getting into the model as a bounding box width. If you look at the traceback, it is happening when we calculate the width.

w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])

Might also be good to sanity check that bbox[0] is always less then bbox[2] in your training data. Beyond that, I'm out of ideas. Good luck friend!

JP-Leite commented 1 year ago

hmmm then, that's probably not the issue then. Although it still seems related, because somehow -17 is getting into the model as a bounding box width. If you look at the traceback, it is happening when we calculate the width.

w_position_embeddings = self.w_position_embeddings(bbox[:, :, 2] - bbox[:, :, 0])

Might also be good to sanity check that bbox[0] is always less then bbox[2] in your training data. Beyond that, I'm out of ideas. Good luck friend!

Thank you so much for the help. After reviewing the DocVQA dataset further it seems like the bbox returned for certain images were in fact causing the issue. MicrosoftTeams-image As an example, the "51142 3977" text on the bottom right portion of the image had bbox that was incorrect, as the "top left" coordinates for those tokens was further right that the "top right" coordinates. Wasn't getting this issue with datasets created by myself as I was treating the OCR results back from textract a bit different that what the DocVQA dataset has.

For anyone interested, I've uploaded the fine-tuned model to Huggingface

mayyank-dpa commented 1 year ago

Hi @JP-Leite,

  1. Can you share the data conversion script?
  2. And how about training this model for webpages? Should we split the webpage into parts or process complete webpage at once.
JP-Leite commented 1 year ago

Hi @JP-Leite,

  1. Can you share the data conversion script?
  2. And how about training this model for webpages? Should we split the webpage into parts or process complete webpage at once.
  1. My data conversion script for the DocVQA dataset is very specific to how that dataset is presented, I'm not sure it would be much help. For custom datasets, my code is distributed across a few lambda functions in AWS so not something easily exported. The Conversion.zip shows the format that needs to be achieved and any OCR service should be able to get you the text and bboxes. For the segments and segment bbox, PaddleOCR can get you actual segments or you can just use lines from any of the standard services.

  2. To be honest, XDoc from Microsoft would most likely be a better option for web based sources, as it can use the "web" layer present or any other project that solves for the WebSRC dataset. If you are dead set on using Ernie-Layout, I think you need to split the web page into "pages" and feed that as page no. I haven't tackled multipage tiffs yet, so I haven't figured out how to feed multipage documents yet either but its on my to-do list.

mayyank-dpa commented 1 year ago

@JP-Leite,

OK. I wanted to use ERNIE as it has additionally modality of layout and the webpage structure changes quite rapidly in my case. How can now split the pages? I thought that I can utilize the layoutparser boxes and cut the pages such that layout boxes don't get divided. Or perhaps use layout boxes as "pages". Any better idea to do this?

mayyank-dpa commented 1 year ago

@JP-Leite,

I had another doubt regarding passing the images. I am not able to figure where does this repo incorporates for the images used in training/ validation? train.json etc have mention about image filename in 'name' tag, but I am unable to understand where does that get passed to model and how is it downloaded.

JP-Leite commented 1 year ago

@JP-Leite,

I had another doubt regarding passing the images. I am not able to figure where does this repo incorporates for the images used in training/ validation? train.json etc have mention about image filename in 'name' tag, but I am unable to understand where does that get passed to model and how is it downloaded.

image data placed within json, not a reference to the image location. When you are building your json training file, bse64 encoded image data goes in the "image" piece of the json. You can see an example above in the Conversion.zip val.json file.

JP-Leite commented 1 year ago

@JP-Leite,

OK. I wanted to use ERNIE as it has additionally modality of layout and the webpage structure changes quite rapidly in my case. How can now split the pages? I thought that I can utilize the layoutparser boxes and cut the pages such that layout boxes don't get divided. Or perhaps use layout boxes as "pages". Any better idea to do this?

Not sure how best to do this. My dumb idea would be to print the webpage to PDF and see where the webpage is being broken up. Then recreate that with a script to generate pages for any webpage. Maybe you can feed an entire webpage as one page, but I think if it becomes long enough, it might mess up bbox parameters since everything gets scaled to a proportion of 1000 as @logan-markewich mentions above. So I guess it really depends on how long these webpages are.

mayyank-dpa commented 1 year ago

@JP-Leite, OK. I wanted to use ERNIE as it has additionally modality of layout and the webpage structure changes quite rapidly in my case. How can now split the pages? I thought that I can utilize the layoutparser boxes and cut the pages such that layout boxes don't get divided. Or perhaps use layout boxes as "pages". Any better idea to do this?

Not sure how best to do this. My dumb idea would be to print the webpage to PDF and see where the webpage is being broken up. Then recreate that with a script to generate pages for any webpage. Maybe you can feed an entire webpage as one page, but I think if it becomes long enough, it might mess up bbox parameters since everything gets scaled to a proportion of 1000 as @logan-markewich mentions above. So I guess it really depends on how long these webpages are.

Yes, scaling will mess up the bboxs. Splitting is the way to go here for now. I'll get back if I can get something on this. Meanwhile, can you upload your converted DocVQA_en dataset on your repo or somewhere and share?

JP-Leite commented 1 year ago

@JP-Leite, OK. I wanted to use ERNIE as it has additionally modality of layout and the webpage structure changes quite rapidly in my case. How can now split the pages? I thought that I can utilize the layoutparser boxes and cut the pages such that layout boxes don't get divided. Or perhaps use layout boxes as "pages". Any better idea to do this?

Not sure how best to do this. My dumb idea would be to print the webpage to PDF and see where the webpage is being broken up. Then recreate that with a script to generate pages for any webpage. Maybe you can feed an entire webpage as one page, but I think if it becomes long enough, it might mess up bbox parameters since everything gets scaled to a proportion of 1000 as @logan-markewich mentions above. So I guess it really depends on how long these webpages are.

Yes, scaling will mess up the bboxs. Splitting is the way to go here for now. I'll get back if I can get something on this. Meanwhile, can you upload your converted DocVQA_en dataset on your repo or somewhere and share?

The file is 30+ GB so I couldn't find anywhere to host it. For inference, the files are on Huggingface. Follow the deploy guide to figure out where those files go.

I've also realised that we may be able to fine tune a fine tuned model by pointing to the final model path of a model that you already fine tuned on( in my case the DocVQA_en model I had trained). To do this you just need to set

python3 -u run_mrc.py \
  --model_name_or_path ernie-layoutx-base-uncased \

to

python3 -u run_mrc.py \
  --model_name_or_path {path_of_fine_tuned_model} \

I'll try to post that folder with all its contents so anyone can start fine tuning on the pretrained DocVQA_en model.

mayyank-dpa commented 1 year ago

Thanks! Will look into it

mayyank-dpa commented 1 year ago

While using infer.py from deploy/python folder, I am getting this error:

Traceback (most recent call last): File "infer.py", line 65, in main() File "infer.py", line 55, in main predictor = Predictor(args) File "/home/<>/Desktop/web_scraper/PaddleNLP/model_zoo/ernie-layout/deploy/python/predictor.py", line 78, in init self.inference_backend = InferBackend(args.model_path_prefix, device=args.device) File "/home/<>/Desktop/web_scraper/PaddleNLP/model_zoo/ernie-layout/deploy/python/predictor.py", line 49, in init self.predictor = paddle.inference.create_predictor(config) RuntimeError: (NotFound) Cannot open file /home/<>/Desktop/web_scraper/Ernie-Layout-DocVQA_en.pdmodel, please confirm whether the file is normal. [Hint: Expected static_cast(fin.is_open()) == true, but received static_cast(fin.is_open()):0 != true:1.] (at /paddle/paddle/fluid/inference/api/analysis_predictor.cc:1500)

JP-Leite commented 1 year ago

While using infer.py from deploy/python folder, I am getting this error:

Traceback (most recent call last): File "infer.py", line 65, in main() File "infer.py", line 55, in main predictor = Predictor(args) File "/home/<>/Desktop/web_scraper/PaddleNLP/model_zoo/ernie-layout/deploy/python/predictor.py", line 78, in init self.inference_backend = InferBackend(args.model_path_prefix, device=args.device) File "/home/<>/Desktop/web_scraper/PaddleNLP/model_zoo/ernie-layout/deploy/python/predictor.py", line 49, in init self.predictor = paddle.inference.create_predictor(config) RuntimeError: (NotFound) Cannot open file /home/<>/Desktop/web_scraper/Ernie-Layout-DocVQA_en.pdmodel, please confirm whether the file is normal. [Hint: Expected static_cast(fin.is_open()) == true, but received static_cast(fin.is_open()):0 != true:1.] (at /paddle/paddle/fluid/inference/api/analysis_predictor.cc:1500)

What is the command you are using to run infer.py? Looks like the pdmodel isn't in the expected location. If you are running the standard command, you should place the pd files in --model_path_prefix ../../mrc_export/inference which would be two folders above where infer.py is located, in its own folder called mrc_export

image

maximepoffet commented 1 year ago

I'll try to post that folder with all its contents so anyone can start fine tuning on the pretrained DocVQA_en model.

Hello @JP-Leite, thank you for your work ! Have you by any chance posted the final model as you said ?

Riyuk-04 commented 1 year ago

@JP-Leite For fine-tuning, I found this repo helpful. Fine-tuning on english DocVQA worked very well

Hi @logan-markewich I tried to load the pertained model from HuggingFace but it had some missing weights in the visual layers, did the same occur for you as well? I was thinking of training on DocVQA to learn the weights.

Missing layers : ['ernie_layout.visual.backbone.resnet.layer0.0.batch_norm1.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.0.batch_norm3.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.1.batch_norm3.num_batches_tracked', 'qa_outputs.weight', 'ernie_layout.visual.backbone.resnet.layer0.0.batch_norm2.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.0.shortcut.1.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.2.batch_norm1.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.2.batch_norm3.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.1.batch_norm2.num_batches_tracked', 'ernie_layout.visual.backbone.batch_norm1.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.2.batch_norm2.num_batches_tracked', 'ernie_layout.visual.backbone.resnet.layer0.1.batch_norm1.num_batches_tracked', 'qa_outputs.bias']