A question about tokenizer

Jermaine1996 commented 12 months ago

Thanks for making this repo,

When I used ErnieLayoutTokenizerFast to tokenize the inputs, I found that the padding tokens would be added at the first of sequences, but I thought it should be added in the end to make all inputs as the same length.

I run the example codes as following:

tokenizer = ErnieLayoutTokenizerFast.from_pretrained('tokenizer_path')
context = ['This is an example sequence', 'All ocr boxes are inserted into this list']
layout = [[381, 91, 505, 115], [738, 96, 804, 122]]
encoding = tokenizer(text=context, boxes=layout, padding='max_length', max_length=50)
print(encoding['input_ids'])
print(encoding['bbox'])
print(encoding['attention_mask'])

Then I will get:

[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 3293, 83, 142, 27781, 40, 944, 3956, 3164, 36, 23150, 16530, 90, 621, 183540, 297, 3934, 903, 5303, 2]

[[0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [0, 0, 0, 0], [381, 91, 505, 115], [381, 91, 505, 115], [381, 91, 505, 115], [381, 91, 505, 115], [381, 91, 505, 115], [381, 91, 505, 115], [381, 91, 505, 115], [738, 
96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [738, 96, 804, 122], [1000, 1000, 1000, 1000]]

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]

Is there something wrong in the codes?

NormXU commented 12 months ago

@Jermaine1996 The tokenizer is initialized as padding_side='left' as default. You can set

tokenizer.padding_side = 'right' to let the tokenizer append the padding tokens on the right side of the sequence,

or use a processor instead:

processor = ErnieLayoutProcessor(image_processor=feature_extractor, tokenizer=tokenizer)
encoding = processor(pil_image, context, boxes=layout, word_labels=labels, return_tensors="pt")

Thank you for pointing it out, I will update the examples by setting tokenizer.padding_side = 'right' as default

Jermaine1996 commented 12 months ago

thanks for responding, it works. :)

NormXU / ERNIE-Layout-Pytorch

A question about tokenizer #21