How to create pytorch Dataset for layloutlmv2 from my custom images and json file.??

KnitVikas commented 2 years ago

@NielsRogge thanks for this layoutlmv2 implemetation in HF.Actually i want to create the torch dataset from my custom images and json file (for now suppose data is FUNSD downloaded) please guide me how can i create this torch dataset so that i can give this data as input to LayoutLMv2Processor and apply map function.

This is what is tried

`import json
import os
import torch
from torch.utils.data import Dataset
from detectron2.data.detection_utils import read_image
from detectron2.data.transforms import ResizeTransform, TransformList
def normalize_bbox(box,size):
    width, height = size[0],size[1]
    return [
        int(1000 * (box[0] / width)),
        int(1000 * (box[1] / height)),
        int(1000 * (box[2] / width)),
        int(1000 * (box[3] / height)),
    ]
def load_image(image_path):
    image = read_image(image_path, format="BGR")
    h = image.shape[0]
    w = image.shape[1]
    img_trans = TransformList([ResizeTransform(h=h, w=w, new_h=224, new_w=224)])
    image = torch.tensor(img_trans.apply_image(image).copy()).permute(2, 0, 1)  # copy to make it writeable
    return image, (w, h)

label2id = {'B-ANSWER': 5,
 'B-HEADER': 1,
 'B-QUESTION': 3,
 'I-ANSWER': 6,
 'I-HEADER': 2,
 'I-QUESTION': 4,
 'O': 0,
'B-O': 7}

class CustomTextDataset(Dataset):
    def __init__(self, filepath):
        self.filepath = filepath 

    def __len__(self):
            return len(os.path.join(self.filepath, "annotations"))
    def __getitem__(self, idx):

        # logger.info("⏳ Generating examples from = %s", filepath)
        ann_dir = os.path.join(self.filepath, "annotations")
        img_dir = os.path.join(self.filepath, "images")
        all_data = []
        for guid, file in enumerate(sorted(os.listdir(ann_dir))):
          # try:
            tokens = []
            bboxes = []
            ner_tags = []

            file_path = os.path.join(ann_dir, file)
            with open(file_path, "r", encoding="utf8") as f:
                data = json.load(f)
            image_path = os.path.join(img_dir, file)
            image_path = image_path.replace("json", "png")
            image, size = load_image(image_path)
            # print("here is the size variable",size)
            for item in data["form"]:
                words, label = item["words"], item["label"]
                words = [w for w in words if w["text"].strip() != ""]
                if len(words) == 0:
                    continue
                if label == "other":
                    for w in words:
                        tokens.append(w["text"])
                        ner_tags.append(label2id["O"])
                        bboxes.append(normalize_bbox(w["box"], size))
                else:
                    tokens.append(words[0]["text"])
                    ner_tags.append(label2id["B-" + label.upper()])
                    bboxes.append(normalize_bbox(words[0]["box"], size))
                    for w in words[1:]:
                        tokens.append(w["text"])
                        ner_tags.append(label2id["I-" + label.upper()])
                        bboxes.append(normalize_bbox(w["box"], size))

            all_data.append({"id": str(guid), "tokens": tokens, "bboxes": bboxes, "ner_tags": ner_tags, "image_path": image_path})
          # except:
          #   print("error")
        sample = all_data[idx]

        return sample`

By creating data this way i was getting while training like : TypeError: LayoutLMv2ForTokenClassification object argument after ** must be a mapping, not list

please help me out. thanks for solution in advance.``

sheikhasim commented 2 years ago

You can refer huggingface documentation for creating a pre-processing file for preparing your dataset for lmv2. I referred the same for my own custom dataset. Below is the FUNSD-preprocessing file that i refrenced for preprocessing custom dataset. https://huggingface.co/datasets/nielsr/funsd/blob/main/funsd.py

You can check that out , this works for LMV2. Hope it helps.

KnitVikas commented 2 years ago

@sheikhasim Thanks for your reply. But https://huggingface.co/datasets/nielsr/funsd/blob/main/funsd.py is downloading the funsd dataset and using that but my data is present in local how to load it using load_dataset of huggingface . what are the changes required ?

sheikhasim commented 2 years ago

@KnitVikas If your dataset is already in required input format i.e. images + json then , There are two ways for that :

Store the zip file of your dataset in gdrive and create a pre-processing script in the hugging face datasets. Mention the path to the gdrive dataset.zip in download and extract command. So once you run load_dataset(name_of_hugging_face_script) It'll load dataset from the gdrive and pre-process.
Second way is to use the hugging face's method of loading the dataset locally. ........ from datasets import load_dataset dataset = load_dataset('PATH/TO/MY/LOADING/SCRIPT', data_files='PATH/TO/MY/FILE') ......... https://huggingface.co/docs/datasets/v1.11.0/loading_datasets.html

NielsRogge / Transformers-Tutorials

How to create pytorch Dataset for layloutlmv2 from my custom images and json file.?? #151