NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.46k stars 1.45k forks source link

Train layoutlmv3 with custom dataset by loading from local directory #123

Closed deepanshudashora closed 1 year ago

deepanshudashora commented 2 years ago

How to train layoutlmv3 with custom dataset by loading from local directory,

jyotiyadav94 commented 2 years ago

@deepanshudashora

where u able to create your own dataset ? you can try using Label Studio OCR.

NielsRogge commented 2 years ago

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)
photopea commented 2 years ago

Hi @NielsRogge , is it possible to get in touch with you somehow? I could not find your email address, your twitter messages are blocked. If you won a billion dollars, there would be no way to tell you about it :D

I am Ivan from www.Photopea.com , wihch is probably the best free photo editor that exists today :) Would you be interested in some kind of a cooperation, that is ? We have millions of users. We would like to add AI features, which could run on the Hugging Face infrastructure. You can write me on Twitter: https://twitter.com/photopeacom or support@photopea.com

aditya11ad commented 2 years ago

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

jyotiyadav94 commented 2 years ago

Hi @aditya11ad

you need to follow the script in order LayoutLMV3 to accept the input - https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py

aditya11ad commented 2 years ago

thanks for the quick response, but i didn't get how this script is taking the inputs .

techthiyanes commented 2 years ago

Actually this script takes bounding box inputs with left top and right bottom..This should be normalised for x with width and y with height..For token alone, if it's start of the word in the sentence, token should be extended with B- ie beginning otherwise it's I- ie intermediate.. Actual data gets downloaded from the site then refer annotation folder for further information. To get an annotation use any OCR option like Google tesseract/azure form recogniser

NielsRogge commented 2 years ago

Hi,

You don't necessarily have to write a script like the one for FUNSD. You can just create a custom PyTorch dataset, which I explain here: https://github.com/NielsRogge/Transformers-Tutorials/issues/123#issuecomment-1156256273

NielsRogge commented 2 years ago

Hi @photopea, thanks for reaching out. I've forwarded your request to the team, someone will reach out :)

aditya11ad commented 2 years ago

hi i have prepared the dataframe like this:

image

now what should be the scrip for fine tuning ?

pavel-nesterov commented 2 years ago

Hi @aditya11ad
This might be helpful https://github.com/ruifcruz/sroie-on-layoutlm/blob/main/LayoutLM_fine_tunning_for_SROIE_dataset.ipynb

PoonamS25 commented 1 year ago

Hi, Can we train layoutLM on custom dataset with .txt annotations files (yolo format annotations files) available on local machine?

NielsRogge commented 1 year ago

Hi @PoonamS25, you'll probably need to convert them to the format that LayoutLM expects. Basically for each document you need a list of words, with corresponding bounding box coordinates and labels. Each bounding box needs to be in the format (x0, y0, x1, y1), where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.

DANISHFAYAZNAJAR commented 1 year ago

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

@NielsRogge What about the Labels. Using OCR we can get words and bounding boxes. But you haven't mentioned anything about labels in it. I believe we need to generate somehow labels also and use them in it. Can you clarify whether we need those or not.

vcjayan commented 1 year ago

@NielsRogge If I do LayoutLM training with custom invoice images, what should be the annotation format ? Should I use Q-A format like FUNSD (invoice_number_Q & invoice_number_A, date_Q & date_A etc.) or can I just annotate all labels directly like invoice_number, invoice_date, vendor_name etc ?

NielsRogge commented 1 year ago

@DANISHFAYAZNAJAR if you have labels at the word level (like the FUNSD dataset has), then you can do the following:

# get document image + corresponding words, boxes and labels at the word level
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
word_labels = item.ner_tags

# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")

# remove batch dimension which the processor adds by default
encoding = {k:v.squeeze() for k,v in encoding.items()}

return encoding
NielsRogge commented 1 year ago

@vcjayan you just need a list of words, their boxes and their labels for each document.

So this could look like:

words = ["hello", "world", "this", "is", "invoice", "number", '14721"]
boxes = [[1,2,3,4] for _ in range(len(words))]
word_labels = ["other", "other", "other", "other", "other", "other", "invoice_number"]

assuming you have 2 classes ("other" and "invoice_number")

ankitarajsharma commented 1 year ago

@NielsRogge How would we specify the train and test splits. I am using this - https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb notebook. I tried the above provided code the dataset is returned as dataset <main.CustomDataset at 0x7a1c979eead0> object whereas the expectation was a DatasetDict object - What am I missing? Please help

NielsRogge commented 1 year ago

You can create 2 instances, like so:

train_dataset = CustomDataset(dataset=dataset["train"], processor=processor)
val_dataset = CustomDataset(dataset=dataset["validation"], processor=processor)
madhavi1102 commented 1 year ago

Hi,

You can create a regular PyTorch Dataset as follows:

from torch.utils.data import Dataset
from PIL import Image

class CustomDataset(Dataset):
     def __init__(self, root, df, processor):
          self.root = root
          self.df = df
          self.processor = processor

     def __getitem__(self, idx):
          # get document image + corresponding words and boxes
          item = self.df.iloc[idx]
          image = Image.open(self.root + ...).convert('RGB')
          words = item.words
          boxes = item.boxes

          # use processor to prepare everything for the model
          encoding = self.processor(image, words, boxes=boxes)

         return encoding

This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.

You can then instantiate the dataset as follows:

from transformers import LayoutLMv3Processor

processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")

dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)

@NielsRogge for this CustomDataset, does the return item in getitem method is only encoding or we can separately return return { "input_ids": torch.tensor(self.encoding["input_ids"][index], dtype=torch.int64), "attention_mask": torch.tensor(self.encoding["attention_mask"][index], dtype=torch.int64), "bbox": torch.tensor(self.encoding["bbox"], dtype=torch.int64), "pixel_values": torch.tensor(self.encoding['pixel_values'], dtype=torch.float32), "labels": torch.tensor(self.encoding['labels'], dtype=torch.int64) }

Dhananjay-97 commented 1 year ago

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

PLease tell me how did you generate the annotations?

Aesthethic0de commented 1 year ago

hi i trained on form dataset annotated documents with 200 images model trained for 100 epochs but i am not getting any result while inferencing

vidya-chandran commented 9 months ago

@NielsRogge Can you help me on creating a mapping between the predicted labels and tokens(words) associated. I tried extracting text from bounding box which was not accurate. can we create a direct mapping from the labels to the words. Many thanks in advance

yashakagf commented 7 months ago

hi I created the annotations folder containing json files , like this:

{
    "form": [
        {
            "box": [
                84,
                109,
                136,
                119
            ],
            "text": "23456789",
            "label": "invoice_num",
            "words": [
                {
                    "box": [
                        84,
                        109,
                        136,
                        119
                    ],
                    "text": "23456789"
                }
            ]
.
.
.

please guide me, how can I train layoutlmV3 on this.

Hey can you guide me , how you prepared dataset in this format

dsoft-jvo commented 3 months ago

You can create 2 instances, like so:

train_dataset = CustomDataset(dataset=dataset["train"], processor=processor)
val_dataset = CustomDataset(dataset=dataset["validation"], processor=processor)

@NielsRogge How should the CustomDataset class be edited to allow these lines?