Closed deepanshudashora closed 1 year ago
@deepanshudashora
where u able to create your own dataset ? you can try using Label Studio OCR.
Hi,
You can create a regular PyTorch Dataset as follows:
from torch.utils.data import Dataset
from PIL import Image
class CustomDataset(Dataset):
def __init__(self, root, df, processor):
self.root = root
self.df = df
self.processor = processor
def __getitem__(self, idx):
# get document image + corresponding words and boxes
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes)
return encoding
This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.
You can then instantiate the dataset as follows:
from transformers import LayoutLMv3Processor
processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base")
dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)
Hi @NielsRogge , is it possible to get in touch with you somehow? I could not find your email address, your twitter messages are blocked. If you won a billion dollars, there would be no way to tell you about it :D
I am Ivan from www.Photopea.com , wihch is probably the best free photo editor that exists today :) Would you be interested in some kind of a cooperation, that is ? We have millions of users. We would like to add AI features, which could run on the Hugging Face infrastructure. You can write me on Twitter: https://twitter.com/photopeacom or support@photopea.com
hi I created the annotations folder containing json files , like this:
{
"form": [
{
"box": [
84,
109,
136,
119
],
"text": "23456789",
"label": "invoice_num",
"words": [
{
"box": [
84,
109,
136,
119
],
"text": "23456789"
}
]
.
.
.
please guide me, how can I train layoutlmV3 on this.
Hi @aditya11ad
you need to follow the script in order LayoutLMV3 to accept the input - https://huggingface.co/datasets/nielsr/funsd-layoutlmv3/blob/main/funsd-layoutlmv3.py
thanks for the quick response, but i didn't get how this script is taking the inputs .
Actually this script takes bounding box inputs with left top and right bottom..This should be normalised for x with width and y with height..For token alone, if it's start of the word in the sentence, token should be extended with B- ie beginning otherwise it's I- ie intermediate.. Actual data gets downloaded from the site then refer annotation folder for further information. To get an annotation use any OCR option like Google tesseract/azure form recogniser
Hi,
You don't necessarily have to write a script like the one for FUNSD. You can just create a custom PyTorch dataset, which I explain here: https://github.com/NielsRogge/Transformers-Tutorials/issues/123#issuecomment-1156256273
Hi @photopea, thanks for reaching out. I've forwarded your request to the team, someone will reach out :)
hi i have prepared the dataframe like this:
now what should be the scrip for fine tuning ?
Hi @aditya11ad
This might be helpful https://github.com/ruifcruz/sroie-on-layoutlm/blob/main/LayoutLM_fine_tunning_for_SROIE_dataset.ipynb
Hi, Can we train layoutLM on custom dataset with .txt annotations files (yolo format annotations files) available on local machine?
Hi @PoonamS25, you'll probably need to convert them to the format that LayoutLM expects. Basically for each document you need a list of words, with corresponding bounding box coordinates and labels. Each bounding box needs to be in the format (x0, y0, x1, y1), where (x0, y0) corresponds to the position of the upper left corner in the bounding box, and (x1, y1) represents the position of the lower right corner.
Hi,
You can create a regular PyTorch Dataset as follows:
from torch.utils.data import Dataset from PIL import Image class CustomDataset(Dataset): def __init__(self, root, df, processor): self.root = root self.df = df self.processor = processor def __getitem__(self, idx): # get document image + corresponding words and boxes item = self.df.iloc[idx] image = Image.open(self.root + ...).convert('RGB') words = item.words boxes = item.boxes # use processor to prepare everything for the model encoding = self.processor(image, words, boxes=boxes) return encoding
This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.
You can then instantiate the dataset as follows:
from transformers import LayoutLMv3Processor processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base") dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)
@NielsRogge What about the Labels. Using OCR we can get words and bounding boxes. But you haven't mentioned anything about labels in it. I believe we need to generate somehow labels also and use them in it. Can you clarify whether we need those or not.
@NielsRogge If I do LayoutLM training with custom invoice images, what should be the annotation format ? Should I use Q-A format like FUNSD (invoice_number_Q & invoice_number_A, date_Q & date_A etc.) or can I just annotate all labels directly like invoice_number, invoice_date, vendor_name etc ?
@DANISHFAYAZNAJAR if you have labels at the word level (like the FUNSD dataset has), then you can do the following:
# get document image + corresponding words, boxes and labels at the word level
item = self.df.iloc[idx]
image = Image.open(self.root + ...).convert('RGB')
words = item.words
boxes = item.boxes
word_labels = item.ner_tags
# use processor to prepare everything for the model
encoding = self.processor(image, words, boxes=boxes, word_labels=word_labels, return_tensors="pt")
# remove batch dimension which the processor adds by default
encoding = {k:v.squeeze() for k,v in encoding.items()}
return encoding
@vcjayan you just need a list of words, their boxes and their labels for each document.
So this could look like:
words = ["hello", "world", "this", "is", "invoice", "number", '14721"]
boxes = [[1,2,3,4] for _ in range(len(words))]
word_labels = ["other", "other", "other", "other", "other", "other", "invoice_number"]
assuming you have 2 classes ("other" and "invoice_number")
@NielsRogge How would we specify the train and test splits. I am using this - https://github.com/NielsRogge/Transformers-Tutorials/blob/master/LayoutLMv3/Fine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb notebook. I tried the above provided code the dataset is returned as dataset <main.CustomDataset at 0x7a1c979eead0> object whereas the expectation was a DatasetDict object - What am I missing? Please help
You can create 2 instances, like so:
train_dataset = CustomDataset(dataset=dataset["train"], processor=processor)
val_dataset = CustomDataset(dataset=dataset["validation"], processor=processor)
Hi,
You can create a regular PyTorch Dataset as follows:
from torch.utils.data import Dataset from PIL import Image class CustomDataset(Dataset): def __init__(self, root, df, processor): self.root = root self.df = df self.processor = processor def __getitem__(self, idx): # get document image + corresponding words and boxes item = self.df.iloc[idx] image = Image.open(self.root + ...).convert('RGB') words = item.words boxes = item.boxes # use processor to prepare everything for the model encoding = self.processor(image, words, boxes=boxes) return encoding
This is just a draft, assuming you have a root folder with all your document images, and a Pandas dataframe that contains the words + boxes for each document image.
You can then instantiate the dataset as follows:
from transformers import LayoutLMv3Processor processor = LayoutLMv3Processor.from_pretrained("microsoft/layoutlmv3-base") dataset = CustomDataset(root="path_to_your_root", df="your_dataframe", processor=processor)
@NielsRogge for this CustomDataset, does the return item in getitem method is only encoding or we can separately return return { "input_ids": torch.tensor(self.encoding["input_ids"][index], dtype=torch.int64), "attention_mask": torch.tensor(self.encoding["attention_mask"][index], dtype=torch.int64), "bbox": torch.tensor(self.encoding["bbox"], dtype=torch.int64), "pixel_values": torch.tensor(self.encoding['pixel_values'], dtype=torch.float32), "labels": torch.tensor(self.encoding['labels'], dtype=torch.int64) }
hi I created the annotations folder containing json files , like this:
{ "form": [ { "box": [ 84, 109, 136, 119 ], "text": "23456789", "label": "invoice_num", "words": [ { "box": [ 84, 109, 136, 119 ], "text": "23456789" } ] . . .
please guide me, how can I train layoutlmV3 on this.
PLease tell me how did you generate the annotations?
hi i trained on form dataset annotated documents with 200 images model trained for 100 epochs but i am not getting any result while inferencing
@NielsRogge Can you help me on creating a mapping between the predicted labels and tokens(words) associated. I tried extracting text from bounding box which was not accurate. can we create a direct mapping from the labels to the words. Many thanks in advance
hi I created the annotations folder containing json files , like this:
{ "form": [ { "box": [ 84, 109, 136, 119 ], "text": "23456789", "label": "invoice_num", "words": [ { "box": [ 84, 109, 136, 119 ], "text": "23456789" } ] . . .
please guide me, how can I train layoutlmV3 on this.
Hey can you guide me , how you prepared dataset in this format
You can create 2 instances, like so:
train_dataset = CustomDataset(dataset=dataset["train"], processor=processor) val_dataset = CustomDataset(dataset=dataset["validation"], processor=processor)
@NielsRogge How should the CustomDataset class be edited to allow these lines?
How to train layoutlmv3 with custom dataset by loading from local directory,