jyotiyadav94 commented 2 years ago

Hi @NielsRogge ,

First of all thank you for your great work. I have been using the notebook of fine-tuning on layoutLMV3. Recently I came across a issue with some images which has not been predicted fully by layoutLMV3 model. As you can see below image is of 300 dpi and the image's bottom half is not predicted by the layoutLMV3 model. I have seen this issue with most of the images and try to test with all LayoutLMV2 as well - https://huggingface.co/spaces/nielsr/LayoutLMv2-FUNSD But the results turn out to be always same. I do not find the root cause of this issue. It is not a issue from pytesseract because if i try to extract the text from image the text of the other bottom half of the images are mostly getting detected. I have tested out using the below space on hugging face https://huggingface.co/spaces/keithhon/Tesseract-OCR

Could you please suggest something what could be the possible things worth to try on in order to solve these kind of issues wit layoutLM models.

Things I have tried

Preprocess the images
Crop the image and check if the bottom half is getting detected - yes the model detects perfectly well

But when not as a whole image

NielsRogge commented 2 years ago

Hi,

That's because LayoutLMv3, like many other Transformer models, has a max sequence length of 512 tokens. Hence one truncates to only the first 512 text tokens which are fed to the model. This is achieved by setting truncation=True when preparing examples for the model.

jyotiyadav94 commented 2 years ago

Thank you @NielsRogge for the clarification. Are there any possibilities we can use more than the fixed token length or set some custom fixed length? because if I am not going to set truncation=True I am likely to get out of index range error.

NielsRogge commented 2 years ago

What's typically done is a so-called "sliding window" approach, where you slide windows of 512 tokens across the document (for instance with a stride of 128 tokens). This means that you first feed tokens [0, 512] through the model, then tokens [128, 640], then tokens [256, 768] etc. You can then average the predictions for tokens that are part of several windows.

jyotiyadav94 commented 2 years ago

Thank you so much for the prompt reply I will look work on this solution and let you know

jyotiyadav94 commented 2 years ago

Hi @NielsRogge ,

I have trained LayoutLMv3 model with "bbox": Array2D(dtype="int64", shape=(512, 4)), but documents have max boxes 1552.

I have tried to change the value 512 to 1024 & 2048 but while training getting ValueError: cannot reshape array of size 2048 into shape (1,1552,4)

so any idea know how to change and any idea to solve this problem

The reference example link you mentioned in your https://github.com/NielsRogge/Transformers-Tutorials/issues/23#issuecomment-918030782 uses AutoTokenizer and has return_overflowing_tokens and stride parameters. But, I could not find such parameters for LayoutLMv3.

Could you please suggest anything on this or provide any code if possible?

Navd15 commented 2 years ago

@jyotiyadav94 stride return_overflowing_tokens return_offsets_mapping are the 3 params we need to split our image into different half(s). They are actually present in PretrainedTokenizerBase class but while using Processor they are passed as the params becuz Processor is responsible for calling the __call__ method of tokenizer. So if u pass those 3 params in Processor it will work as expected. Hope this helps.

jyotiyadav94 commented 2 years ago

Hi @Navd15 ,

Thank you for the suggestions. I have tried to use the above parameters in the processor case 1: encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True,padding="max_length",stride=128,return_overflowing_tokens=True,return_offsets_mapping=False)

case 2: encoding = processor(images, words, boxes=boxes, word_labels=word_labels, truncation=True,padding="max_length",stride=128,return_overflowing_tokens=True,return_offsets_mapping=True)

Nothing seems to be working for me

If I don't use the parameter truncation=true. Just go forward with the stride parameter encoding = processor(images, words, boxes=boxes, word_labels=word_labels,padding="max_length",stride=128)

Could you please suggest if this is the right approach or I am missing something

Navd15 commented 2 years ago

Hi @jyotiyadav94 Case 1 is the right approach to make dataset for training. set truncate=True and also specify the max_length=512 in processor, max_length is important as this is the length it will make chunks of the image into. Also pop/remove overflow_to_sample_mapping & offset_mapping from the processor's output as these keys are not expected by the model. So suppose with above params after tokenization of an image if there 550 tokens in total generated by tokenizer there will be 2 subfiles/chunks of the image. Now you can handle those inner chunks of image it in ur get_item method in a way u want to. Hope this helps

jyotiyadav94 commented 2 years ago

Hi @Navd15 ,

Thank you so much for your suggestions it worked out with the parameters. I am thinking about your last sentence get_item() about handling the two chunks of images.I found there is already a chain https://github.com/NielsRogge/Transformers-Tutorials/issues/41 for this probably you are referring to this. I was in the assumption this should be handled by the parameter stride. How do you actually handle splitting of documents at the inference part ?

Navd15 commented 2 years ago

@jyotiyadav94 stride only gives us some sort of overlap so that model does not miss learning the features(more specifically labels !) which lie halfway between two chunks. During inference time there are more than 1 ways to handle it. The most basic is for each file in batch, send it to model for output after chunking it using the same processor (and of course same params). The inference part can be tricky because it heavily depends on one's problem context. Most you can do is to think about training and inference for a single file rather than a batch else can have hard time understanding the dimensions of input_ids, attention_mask etc.

tadinhkien99 commented 2 years ago

Hi @Navd15 ,

Thank you so much for your suggestions it worked out with the parameters. I am thinking about your last sentence get_item() about handling the two chunks of images.I found there is already a chain #41 for this probably you are referring to this. I was in the assumption this should be handled by the parameter stride. How do you actually handle splitting of documents at the inference part ?

Hi, have you solved it? I can create map data like you but when training I have error: ValueError: too many values to unpack (expected 2). Can you show me code you processed this and when training? Thank you.

Atul997 commented 2 years ago

@jyotiyadav94 How to use set_format and retrieve actual data from the dictionary in custom pytorch dataloader? Can you please post the example for it or how you did it?

jyotiyadav94 commented 2 years ago

Hi @Atul997

you can extract them by labelling different names to the label and then extract those labels.But I haven't found a good approach to extract as questions and answers

jyotiyadav94 commented 2 years ago

Hi @tadinhkien99

If you use these parameters as I have used in the code. you can get the offset_mapping and overflow_to_sample_mapping and rest you can use the same code https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb#scrollTo=F4gNkWLjW42_

I haven't got any error during the training phase. I am still implementing the solution based on above @Navd15 comment. For time being I also implemented the solution in the inference where I split the image into half and and then make predictions.

tadinhkien99 commented 2 years ago

Hi @tadinhkien99

If you use these parameters as I have used in the code. you can get the offset_mapping and overflow_to_sample_mapping and rest you can use the same code https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb#scrollTo=F4gNkWLjW42_

I haven't got any error during the training phase. I am still implementing the solution based on above @Navd15 comment. For time being I also implemented the solution in the inference where I split the image into half and and then make predictions.

Thank you, I fixed it and I also completed __get_item__ to stride image and don't need to split images into half. But problem now is training code only can run on batch_size=1. Do you know how to fix batch_size? Also I don't use Train funtion from transformer.

Atul997 commented 2 years ago

@jyotiyadav94 Thanks for the reply, it worked for me after restarting the colab notebook.

Atul997 commented 2 years ago

@jyotiyadav94 How to inference with such settings where we use processor with arguments like return_overflowing_tokens because when I am inferencing I am getting error at line true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

jyotiyadav94 commented 2 years ago

Hi @tadinhkien99 If you use these parameters as I have used in the code. you can get the offset_mapping and overflow_to_sample_mapping and rest you can use the same code https://colab.research.google.com/github/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb#scrollTo=F4gNkWLjW42_ I haven't got any error during the training phase. I am still implementing the solution based on above @Navd15 comment. For time being I also implemented the solution in the inference where I split the image into half and and then make predictions.

Thank you, I fixed it and I also completed get_item to stride image and don't need to split images into half. But problem now is training code only can run on batch_size=1. Do you know how to fix batch_size? Also I don't use Train funtion from transformer.

@tadinhkien99 Could you please share the reference code for this and how actually you handle them at the inference part with the above settings?

@Atul997

Atul997 commented 2 years ago

@jyotiyadav94 While inferencing I found that above settings output shape is [batch_size, sequence_lenght, num_labels] in my case it is torch.Size([5, 512, 10]). What I understand is that it returns whole output in chunks. Also I am getting error

ValueError                                Traceback (most recent call last)

 [<ipython-input-56-4c6325471edd>](https://localhost:8080/#) in <module>
  3 is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0
  4 
 ----> 5 true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
  6 true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]

 [<ipython-input-56-4c6325471edd>](https://localhost:8080/#) in <listcomp>(.0)
  3 is_subword = np.array(offset_mapping.squeeze().tolist())[:,0] != 0
  4 
   ----> 5 true_predictions = [id2label[pred] for idx, pred in enumerate(predictions) if not is_subword[idx]]
  6 true_boxes = [unnormalize_box(box, width, height) for idx, box in enumerate(token_boxes) if not is_subword[idx]]

   ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I am stuck at this point. If you have any suggestions please share.

Atul997 commented 1 year ago

Can any one share inferencing steps for whole image ?

Navd15 commented 1 year ago

Can any one share inferencing steps for whole image ? hey @Atul997 can u please tell for which part of inference your query is about.?

nik13 commented 1 year ago

has anyone solved it?

mit1280 commented 1 year ago

Hey @nik13, check this out https://github.com/huggingface/transformers/issues/19190. I modified little bit in the inference part and worked for me.

arvindrajan92 commented 1 year ago

What's typically done is a so-called "sliding window" approach, where you slide windows of 512 tokens across the document (for instance with a stride of 128 tokens). This means that you first feed tokens [0, 512] through the model, then tokens [128, 640], then tokens [256, 768] etc. You can then average the predictions for tokens that are part of several windows.

I can confirm that this approach works, or in other words, more reliable, over increasing token sequence length.

I have tried both the sliding window and increasing token sequence length options. Increasing the sequence length for a model that has been pre-trained on 512 tokens is not a good idea. It works well for in-distribution samples but not out-of-sample data - this tells me that the pre-trained model loses its ability to generalise.

mit1280 commented 1 year ago

I have created notebooks on the LayoutLM training and inference. It can handle whole image as image is divided into 512 tokens. Notebook

nikhilKumarMarepally commented 1 year ago

`image = example["image"] words = example["tokens"] boxes = example["bboxes"] word_labels = example["ner_tags"]

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, truncation=True)

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, truncation=True, stride =128, padding="max_length", max_length=512, return_overflowing_tokens=True, return_offsets_mapping=True)

offset_mapping = encoding.pop('offset_mapping')

overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')

offset_mapping = encoding.pop('offset_mapping')

overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')

x = [] for i in range(0, len(encoding['pixel_values'])): x.append(torch.from_numpy(encoding['pixel_values'][i])) x = torch.stack(x) encoding['pixel_values'] = x

encoding = prepare_examples(example)

for k,v in encoding.items(): print(k, v.shape)`

AttributeError Traceback (most recent call last) /tmp/ipykernel_100068/2923212323.py in 24 25 for k,v in encoding.items(): ---> 26 print(k, v.shape)

AttributeError: 'list' object has no attribute 'shape'

I am having this error while running inference, can you help me @mit1280

mit1280 commented 1 year ago

Hi @nikhilKumarMarepally, please check https://github.com/mit1280/Document-AI/blob/main/LayoutLMv3_Inference.ipynb

you need to stack "input_ids", "attention_mask", "bbox". All are in list so first convert to tensor and then stack it. This will resolve issue.

nikhilKumarMarepally commented 1 year ago

Thanks @mit1280 it got resolved thanks for the help. I have a question, I am working on signature detection using layoutlmv3, I have the bounding boxes but as there wouldn't be any ocr in it. The text would be just signature: is it the right way to train the model.

mit1280 commented 1 year ago

Hi @nikhilKumarMarepally, for LayoutLmv3 training you need page text, bounding box - coordination and label and image. If you have data like https://guillaumejaume.github.io/FUNSD/ then you can use LayoutLM else please provide more details on your training dataset (what kind of data do you have).

jefferyvvv commented 1 year ago

hi @mit1280 , i checked the note script of layoutlm, found that when the text is empty, the bounding box is ignored. Is there any method that helps when ocr text is empty for some reason like blurry or misdetect, but bounding box exists. Or i want to check whether certain region is filled or not.

mit1280 commented 1 year ago

Hi @jefferyvvv, layout LM works on two things: layout e.g. position and text value. Model decides label based on position of text and value of text. When I was playing with LayoutLM and I didn't have that filter at that time model threw an error that length of bounding box and length of number of words are not same. Maybe you can add empty strings with the mention position to resolve that error.

What I am saying is:

Let say text is empty so list will be [] and there are two bounding box for the image e.g. [[0,0,0.5,0.5], [1,1,0.5,0.5]].
You can update text list like this: ['', '']. In this case you will not have length mismatch error.

I wanted to test the model and there were only 2 examples with empty strings so I added filter. Let me know if it make sense.

NielsRogge / Transformers-Tutorials

Bottom half of the image is not predicted by the LayoutLMv3 model #203

encoding = processor(image, words, boxes=boxes, word_labels=word_labels, truncation=True)

offset_mapping = encoding.pop('offset_mapping')

overflow_to_sample_mapping = encoding.pop('overflow_to_sample_mapping')

encoding = prepare_examples(example)