NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.15k stars 1.42k forks source link

Need help in key value pair extraction. #144

Closed Laxmi530 closed 1 year ago

Laxmi530 commented 2 years ago

Can someone please guide me how can I get the key value pair from a scanned invoice using LayoutLM.

NielsRogge commented 2 years ago

Refer to https://github.com/huggingface/transformers/issues/15451#issue-1120232737

fraps12 commented 2 years ago

Model for relation extraction working quite bad on real data. Maybe i failed in training or data prep. Maybe you will be more succesfull in that

My advice is to use the output from layoutlmv2_for_token_classification in some alghoritmic logic for forming a key-value pairs. You will need a module for text grouping based on their label(from model prediction), location between same labeled tokens, location of different labeled tokens and so on.

Can't provide the code but it's working

Laxmi530 commented 2 years ago

@fraps12 Thanks for the replay.

Just wanted to how the model is predicting post that will see. Will go for fine tuning or will go for training. As of now i have tried this much but i am getting error. Can you please help me to fix the error.

feature_extractor = LayoutLMv2FeatureExtractor(apply_ocr=False)
tokenizer = AutoTokenizer.from_pretrained(path, pad_token='')
model = LayoutLMv2ForRelationExtraction.from_pretrained(path)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

image_file = 'image4.png'
image = Image.open(image_file).convert('RGB')
image

width, height = image.size
w_scale = 1000/width
h_scale = 1000/height
ocr_data = pytesseract.image_to_data(image, output_type='data.frame')
ocr_data = ocr_data.dropna()
ocr_data.assign(left_scaled = ocr_data.leftw_scale, width_scaled = ocr_data.widthw_scale,
top_scaled = ocr_data.toph_scale, height_scaled = ocr_data.heighth_scale,
right_scaled = lambda x: x.left_scaled + x.width_scaled,
bottom_scaled = lambda x: x.top_scaled + x.height_scaled)
float_cols = ocr_data.select_dtypes('float').columns
ocr_data[float_cols] = ocr_data[float_cols].round(0).astype(int)
ocr_data = ocr_data.replace(r'^\s*$', np.nan, regex=True)
ocr_data = ocr_data.dropna().reset_index(drop=True)
ocr_datawords = list(ocr_data.text)

coordinates = ocr_data[['left', 'top', 'width', 'height']]
actual_boxes = []
for idx, row in coordinates.iterrows():
x, y, w, h = tuple(row) # the row comes in (left, top, width, height) format
actual_box = [x, y, x+w, y+h] # we turn it into (left, top, left+widght, top+height) to get the actual box
actual_boxes.append(actual_box)

def normalize_box(box, width, height):
return [
int(1000 * (box[0] / width)),
int(1000 * (box[1] / height)),
int(1000 * (box[2] / width)),
int(1000 * (box[3] / height)),
]
boxes = []
for box in actual_boxes:
boxes.append(normalize_box(box, width, height))
encoding = tokenizer.encode_plus(ocr_datawords, boxes=boxes, return_tensors='pt')
input_id = encoding['input_ids']
attention_masks = encoding['attention_mask']
boxes = encoding['bbox']
encoding.keys()
outputs = model(**encoding)

this is the error.

AttributeError                            Traceback (most recent call last)
c:\Users\name\Parallel\Trans_LayoutXLM.ipynb Cell 9 in <cell line: 1>()
----> [1](vscode-notebook-cell:/c%3A/Users/name/Parallel%20Project/Trans_LayoutXLM.ipynb#ch0000009?line=0) outputs = model(**encoding)

File c:\Users\name\.conda\envs\layoutlmft\lib\site-packages\torch\nn\modules\module.py:1130, in Module._call_impl(self, *input, **kwargs)
   1126 # If we don't have any hooks, we want to skip the rest of the logic in
   1127 # this function, and just call forward.
   1128 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
   1129         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1130     return forward_call(*input, **kwargs)
   1131 # Do not call functions when jit is used
   1132 full_backward_hooks, non_full_backward_hooks = [], []

File c:\Users\name\.conda\envs\layoutlmft\lib\site-packages\transformers\models\layoutlmv2\modeling_layoutlmv2.py:1585, in LayoutLMv2ForRelationExtraction.forward(self, input_ids, bbox, labels, image, attention_mask, token_type_ids, position_ids, head_mask, entities, relations)
   1522 @add_start_docstrings_to_model_forward(LAYOUTLMV2_INPUTS_DOCSTRING.format("batch_size, sequence_length"))
   1523 @replace_return_docstrings(output_type=RegionExtractionOutput, config_class=_CONFIG_FOR_DOC)
   1524 def forward(
   (...)
   1535     relations=None,
   1536 ):
   1537     r"""
   1538     entities (list of dicts of shape `(batch_size,)` where each dict contains:
   1539         {
   (...)
   1582     >>> relations = *****
   1583     ```"""
-> 1585     outputs = self.layoutlmv2(
   1586         input_ids=input_ids,
   1587         bbox=bbox,
   1588         image=image,
   1589         attention_mask=attention_mask,
   1590         token_type_ids=token_type_ids,
   1591         position_ids=position_ids,
   1592         head_mask=head_mask,
...
--> 590     images_input = ((images if torch.is_tensor(images) else images.tensor) - self.pixel_mean) / self.pixel_std
    591     features = self.backbone(images_input)
    592     features = features[self.out_feature_key]

AttributeError: 'NoneType' object has no attribute 'tensor'
Laxmi530 commented 1 year ago

Able to extract key-value pair, hence closing the issue.

hjerbii commented 1 year ago

Hello @Laxmi530 Could you please explain to me how you got the key-value pairs? Have you used the LayoutLmForRelationExtraction model?

Thanks!

Laxmi530 commented 1 year ago

I used LayoutLMV2 for the key value pair extraction. From the form recognition set the question as key and the answer as value. Need to apply some technique.

hjerbii commented 1 year ago

Thanks for your answer @Laxmi530 .

I used LayoutLMV2 for the key value pair extraction.

You mean LayoutLMV2 for token classification, or it's another model?

To link between the questions and answers, is it possible to share your approach? Actually, LayoutLMv2 for token classification operates only on token-level, ie does not detect full questions/ answers. So sometimes, it's not possible to associate tokens to get full keys/ values.

Thanks a lot!

Laxmi530 commented 1 year ago

I used this

feature_extractor = LayoutLMv2FeatureExtractor.from_pretrained("microsoft/layoutlmv2-base-uncased")
tokenizer = LayoutLMv2TokenizerFast.from_pretrained("microsoft/layoutlmv2-base-uncased")
model = LayoutLMv2ForTokenClassification.from_pretrained("nielsr/layoutlmv2-finetuned-funsd")

the key value pair extraction is based on token basics only. You need to finetune the model on your dataset as like FUNSD dataset. I did not go deep dive for the key-value pair extraction but yes i finetuned the model, out of 5 documents in 3 document it extracts key value nicely. One more thing it is using pytesseract behind the scenes what text it extracts it will process in that way.

hjerbii commented 1 year ago

@Laxmi530
Thanks for the explanation. But LayoutLMv2ForTokenClassification does not associate keys and values. It does only extract keys and values at token-level without associating all together tokens that belong to the same key or value. That's why I wanted to know how you could associate them on your side (ie token-level -> key/value level -> key-value pair)?

Laxmi530 commented 1 year ago

Sure, I will help you but i did not get any details on your github profile. So, can you please share any details like Linkdin or anything of your so that in future if i need any help I can msg you.

hjerbii commented 1 year ago

A lot of thanks @Laxmi530. You should now see my LinkedIn profile link on my github profile!

If you want, we can keep talking about the relation extraction there.

Laxmi530 commented 1 year ago

Thank you so much @hjerbii for sharing your Linkdin profile. whatever you have doubt will discusses over there. Thank you.

Muhammad-Hamza-Jadoon commented 1 month ago

A lot of thanks @Laxmi530. You should now see my LinkedIn profile link on my github profile!

If you want, we can keep talking about the relation extraction there.

hi bro, im also working on how to extract key value pairs from image documnets. so far im stuck at this token classification which layoutlm does. Can you provide some further information into how you extracted key-value pairs from document like images.

Id really appreciate if you share with me workings of Laxmi530.

regards

NielsRogge commented 1 month ago

Hi,

For key-value pair extraction I would recommend leveraging generative models like Donut, PaliGemma. These are trained to simply generate a JSON for instance given a document image, so that's much easier to handle compared to LayoutLM.

See my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

Muhammad-Hamza-Jadoon commented 1 month ago

Hi,

For key-value pair extraction I would recommend leveraging generative models like Donut, PaliGemma. These are trained to simply generate a JSON for instance given a document image, so that's much easier to handle compared to LayoutLM.

See my demo notebook: https://github.com/NielsRogge/Transformers-Tutorials/blob/master/PaliGemma/Fine_tune_PaliGemma_for_image_%3EJSON.ipynb

But my documents are really messed up. Take this for example: 1dc2cd86-18f2-4f89-b685-9234366cbecc

so far ive been trying to understand Layoutlms, and using that relation extraction module on top, that was provided in some collab notebooks, to get the key value pairs, based on the finetuned xfunsd dataset. I haven't been able to execute this right now. Running into errors

I have to extract something like: "country": "us" "contact name":"luis perez" etc etc. I also have to consider all the tables so thats another thing

All i want to ask you is for some general guidance and direction for me at this point. Should i leave layoutlm approach altogether and try to use these PaliGemma and donut models? Also they seem to only work for small invoice images. Will they be capable enough to work for doc images that resemble mine.

Much appreciated. Regards.