tokenizer can not return offset_map

NormXU / ERNIE-Layout-Pytorch

An unofficial Pytorch implementation of ERNIE-Layout which is originally released through PaddleNLP.

http://arxiv.org/abs/2210.06155

MIT License

99 stars 11 forks source link

tokenizer can not return offset_map #4

Closed WallE-Chang closed 1 year ago

WallE-Chang commented 1 year ago

Hi, I read your tokenizer code which is subclass of PretrainedTokenizer. But PretrainedTokenizer of paddlenlp is more similar to PretrainedTokenizerFast of transformers, which means tokenizer can return offset. The code as following

content_encoded_inputs = tokenizer(
  text=[prompt],
  text_pair=[this_text_line],
  max_seq_len=max_seq_len,
  return_dict=False,
  return_offsets_mapping=True,
)

WallE-Chang commented 1 year ago

I find the reason. The prepare_for_model funtion of PreTrainedTokenizerBase in paddlenlp and transfromers is different. prepare_for_model in paddlenlp have return offset , but transfromer dosen't . This is the code how paddlenlp handling offser . https://github.com/PaddlePaddle/PaddleNLP/blob/develop/paddlenlp/transformers/tokenizer_utils_base.py#L2764

NormXU commented 1 year ago

fix it commit-d827f5cac3d6973738e233bf0381e4b374dd6c3f