NielsRogge / Transformers-Tutorials

This repository contains demos I made with the Transformers library by HuggingFace.
MIT License
9.15k stars 1.42k forks source link

Layoutxlm different input size when batch #75

Open guoxiaolu opened 2 years ago

guoxiaolu commented 2 years ago

I have tested it using layoutxlm on sroie, however, each sample encoded_inputs size is different, like 176, 348, and this input doesn't have "token_type_ids" key. This makes model training failed. Besides, model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutxlm-base', num_labels=len(labels)).This leads to the error: File "/home/guoxiaolu/.local/lib/python3.6/site-packages/transformers/modeling_utils.py", line 1489, in from_pretrained model = cls(config, *model_args, **model_kwargs) TypeError: init() got an unexpected keyword argument '_configuration_file If 'microsoft/layoutlmv2-base-uncased' is loaded, it is correct. ` class SROIEDataset(Dataset): """SROIE dataset."""

def __init__(self, annotations, image_file_names, processor=None, max_length=512):
    """
    Args:
        annotations (List[List]): List of lists containing the word-level annotations (words, labels, boxes).
        image_dir (string): Directory with all the document images.
        processor (LayoutLMv2Processor): Processor to prepare the text + image.
    """
    self.words, self.labels, self.boxes = annotations
    self.image_file_names = image_file_names
    self.processor = processor

def __len__(self):
    return len(self.image_file_names)

def __getitem__(self, idx):
    # first, take an image
    item = self.image_file_names[idx]
    image = Image.open(item).convert("RGB")

    # get word-level annotations 
    words = self.words[idx]
    boxes = self.boxes[idx]
    word_labels = self.labels[idx]

    assert len(words) == len(boxes) == len(word_labels)

    word_labels = [label2id[label] for label in word_labels]
    # use processor to prepare everything
    encoded_inputs = self.processor(image, words, boxes=boxes, word_labels=word_labels, 
                                    padding="max_length", truncation=True, 
                                    return_tensors="pt")

    # remove batch dimension
    for k,v in encoded_inputs.items():
      encoded_inputs[k] = v.squeeze()

    print(encoded_inputs.input_ids.shape)
    assert encoded_inputs.input_ids.shape == torch.Size([512])
    assert encoded_inputs.attention_mask.shape == torch.Size([512])
    assert encoded_inputs.token_type_ids.shape == torch.Size([512])
    assert encoded_inputs.bbox.shape == torch.Size([512, 4])
    assert encoded_inputs.image.shape == torch.Size([3, 224, 224])
    assert encoded_inputs.labels.shape == torch.Size([512]) 
    return encoded_inputs

def main(): train_file = xxx test_file = xxx train, train_flist = file_deserialize(train_file) test, test_flist = file_deserialize(test_file)

    all_labels = [item for sublist in train[1] for item in sublist] + [item for sublist in test[1] for item in sublist]
    Counter(all_labels)
    labels = list(set(all_labels))
    print(labels)

    label2id = {label: idx for idx, label in enumerate(labels)}
    id2label = {idx: label for idx, label in enumerate(labels)}
    print(label2id)
    print(id2label)

    # processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", revision="no_ocr")
    processor = LayoutXLMProcessor.from_pretrained("microsoft/layoutxlm-base", apply_ocr=False)
    train_dataset = SROIEDataset(annotations=train,
                                image_file_names=train_flist, 
                                processor=processor)
    for t in train_dataset:
        pass
    test_dataset = SROIEDataset(annotations=test,
                                image_file_names=test_flist, 
                                processor=processor)

    train_dataloader = DataLoader(train_dataset, batch_size=2, shuffle=True)
    test_dataloader = DataLoader(test_dataset, batch_size=2)

    # model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutlmv2-base-uncased', num_labels=len(labels))
    model = LayoutLMv2ForTokenClassification.from_pretrained('microsoft/layoutxlm-base', num_labels=len(labels))`
yellowjs0304 commented 2 years ago

Hi, @guoxiaolu Did you fix the problem?

guoxiaolu commented 2 years ago

Hi, @guoxiaolu Did you fix the problem?

no...

NielsRogge commented 2 years ago

Hi,

The error you are getting is fixed, it will be included in the next release (which comes out today).

yellowjs0304 commented 2 years ago

@NielsRogge I think it still have a problem that the processor output's size don't match each others.
Also, the encoded_inputs still doesn't have the token_type_ids key. if it is fixed, am I need to modify something the Custom Dataset?

I'm using below versions.

transformers -4.18.0.dev0 
processor = LayoutXLMProcessor.from_pretrained("microsoft/layoutxlm-base", apply_ocr=False)
model = LayoutLMv2Model.from_pretrained("microsoft/layoutxlm-base", num_labels=len(labels))

I didn't define any layoutxlmTokenizer, Feature Extractor.

p.s. +) Did you(@guoxiaolu) fix it?? If so, Could you please tell me the way?