Tokenizer does not support boxes as input

logan-markewich commented 1 year ago

It looks like the tokenizer extends XLNet

However, XLNet doesn't support having boxes in the input. I've been relying on this with LayoutLM and LiLT to automatically align my boxes with the tokenized inputs. It's a pain to do manually haha

Is there any way this could be supported?

sample input:

tokenized_inputs = self.tokenizer(doc_tokens, 
                                  boxes=boxes,
                                  word_labels=labels, 
                                  truncation=True, 
                                  padding="max_length",
                                  max_length=self.max_length, 
                                  stride=self.doc_stride,
                                  return_overflowing_tokens=True,
                                  return_tensors='pt')

current error:

  File "/home/ysi.yardi.com/lm30640/projects/Invoice_OCR_Engine/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2523, in __call__
    encodings = self._call_one(text=text, text_pair=text_pair, **all_kwargs)
  File "/home/ysi.yardi.com/lm30640/projects/Invoice_OCR_Engine/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2626, in _call_one
    **kwargs,
  File "/home/ysi.yardi.com/lm30640/projects/Invoice_OCR_Engine/venv/lib/python3.7/site-packages/transformers/tokenization_utils_base.py", line 2817, in batch_encode_plus
    **kwargs,
TypeError: _batch_encode_plus() got an unexpected keyword argument 'boxes'

NormXU commented 1 year ago

Sure, I will see what I can do.

logan-markewich commented 1 year ago

Awesome! Thanks again for your work on this 🙏🙏💪💪

logan-markewich commented 1 year ago

@NormXU any luck with this? 👀

NormXU commented 1 year ago

Sorry for the late update. I hope this commit can help

NormXU / ERNIE-Layout-Pytorch

Tokenizer does not support boxes as input #14