Closed maxjeblick closed 1 year ago
git lfs install
git clone https://huggingface.co/Norm/ERNIE-Layout-Pytorch
ErnieLayoutTokenizer
by overriding XLNetTokenizer
, but the outcome seems totally wrong to me. I will update the repo once I successfully implement a correct ErnieLayoutTokenizer
.The following code gives the same tokenization results on a subset on the glue dataset.
do_tokenize_postprocess
needs to be set to False
in paddlenlp tokenizer. Otherwise, postprocessing will remove single _
tokens.
import numpy as np
import pandas as pd
from datasets import load_dataset
from paddlenlp.transformers import AutoTokenizer as PaddleNlpAutoTokenizer, ErnieLayoutTokenizer
from tqdm import tqdm
from transformers import AutoTokenizer, XLNetTokenizerFast
tokenizer_uie_x_base: ErnieLayoutTokenizer = PaddleNlpAutoTokenizer.from_pretrained("uie-x-base")
tokenizer_uie_x_base.do_tokenize_postprocess = False
# path_to_downloaded_transformer_model_weights contains files from
# https://huggingface.co/Norm/ERNIE-Layout-Pytorch/tree/main
# Replace "model_type": "xlnet" in corresponding config.json
# Also, you need to remove max_position_embeddings in config.json, otherwise you will get an error
path_to_downloaded_transformer_model_weights = "ernie_layout_tokenizer"
tokenizer_hf: XLNetTokenizerFast = AutoTokenizer.from_pretrained(path_to_downloaded_transformer_model_weights)
dataset = load_dataset("glue", "ax")
df = pd.DataFrame(dataset)
texts = [row.values[0]['premise'] for idx, row in df.iterrows()]
for idx, text in tqdm(enumerate(texts)):
if tokenizer_uie_x_base(text)['input_ids'] != tokenizer_hf(text)['input_ids']:
print("-" * 100)
print(np.array(tokenizer_hf.tokenize(text)))
print("*" * 100)
print(np.array(tokenizer_uie_x_base.tokenize(text)))
print(np.array(text.split(" ")))
print(np.array(tokenizer_uie_x_base(text)['input_ids']))
print(np.array(tokenizer_hf(text)['input_ids']))
print("=" * 100)
import numpy as np
from datasets import load_dataset
from paddlenlp.transformers import AutoTokenizer as PaddleNlpAutoTokenizer, ErnieLayoutTokenizer
from tqdm import tqdm
from transformers import AutoTokenizer, XLNetTokenizerFast
tokenizer_uie_x_base: ErnieLayoutTokenizer = PaddleNlpAutoTokenizer.from_pretrained("uie-x-base")
tokenizer_uie_x_base.do_tokenize_postprocess = False
# path_to_downloaded_transformer_model_weights contains files from
# https://huggingface.co/Norm/ERNIE-Layout-Pytorch/tree/main
# Replace "model_type": "xlnet" in corresponding config.json
# Also, you need to remove max_position_embeddings in config.json, otherwise you will get an error
path_to_downloaded_transformer_model_weights = "ernie_layout_tokenizer"
tokenizer_hf: XLNetTokenizerFast = AutoTokenizer.from_pretrained(path_to_downloaded_transformer_model_weights)
dataset = load_dataset("amazon_reviews_multi", "all_languages")['train']
print(dataset[0])
print(len(dataset))
# "all_languages", "de", "en", "es", "fr", "ja", "zh"
wrong_dict = dict()
total_dict = dict()
debug = False
for idx, input in tqdm(enumerate(dataset), total=len(dataset)):
total_dict[input["language"]] = total_dict.get(input["language"], 0) + 1
text = input['review_body']
if tokenizer_uie_x_base(text)['input_ids'] != tokenizer_hf(text)['input_ids']:
wrong_dict[input["language"]] = wrong_dict.get(input["language"], 0) + 1
if debug:
print("-" * 100)
print(np.array(tokenizer_hf.tokenize(text)))
print("*" * 100)
print(np.array(tokenizer_uie_x_base.tokenize(text)))
print(np.array(text.split(" ")))
print(np.array(tokenizer_uie_x_base(text)['input_ids']))
print(np.array(tokenizer_hf(text)['input_ids']))
print("=" * 100)
for key in wrong_dict:
print(f"Language: {key}")
print(f"Total: {total_dict[key]}")
print(f"Wrong total: {wrong_dict[key]}")
print(f"Wrong relative: {wrong_dict[key] / total_dict[key]}")
print("-" * 100)
gets
Language: de
Total: 200000
Wrong total: 932
Wrong relative: 0.00466
----------------------------------------------------------------------------------------------------
Language: en
Total: 200000
Wrong total: 594
Wrong %: 0.00297
----------------------------------------------------------------------------------------------------
Language: es
Total: 200000
Wrong total: 555
Wrong relative: 0.002775
----------------------------------------------------------------------------------------------------
Language: fr
Total: 200000
Wrong total: 998
Wrong relative: 0.00499
----------------------------------------------------------------------------------------------------
Language: ja
Total: 200000
Wrong total: 521
Wrong relative: 0.002605
----------------------------------------------------------------------------------------------------
Language: zh
Total: 200000
Wrong total: 1065
Wrong relative: 0.005325
----------------------------------------------------------------------------------------------------
The reason for this is that text with repeated characters "This is the best movie !!!!"
is split differently, i.e.
"!!!!"
gets either tokenized into ["!!!", "!"]
or ["!", "!!!"]
. For all practical purposes, this shouldn't matter.
Changing to
text = input['review_body']
for char in ".,?!*":
text = re.sub(f'\*{char}', char, text)
gives exactly 1 tokenization difference in japanese.
@NormXU I have an additional question on how you obtained the corresponding tokenizer files you shared on hugginface?
@NormXU I have an additional question on how you obtained the corresponding tokenizer files you shared on hugginface?
I simply downloaded the file from paddle repo
Your repo also has an additional tokenizer.json
file that is needed to load fast tokenizers. I wonder how you where able to create this file.
I tried to recreate the file by slow -> fast
tokenizer conversion, but I encountered some issues along the way.
PaddleNlp uses sentencepiece tokenizer, but add an offset of 1 to the vocab (see here).
In addition, it modifies the vocab by
self.tokens_to_ids = {"[CLS]": 0, "[PAD]": 1, "[SEP]": 2, "[UNK]": 3}
I make use of the BertTokenizerFast class from transformers library.
Suppose all the configuration you downloaded from PaddleNLP are saved under pretrained_path
. There should be a tokenizer_config.json
, a vocab.txt
, a sentencepiece.bpe.model
and a special_tokens_map.json
. These are files related to tokenizer initialization.
Then, Let's use the BertTokenizerFast from transformers.
from transformers import BertTokenizerFast
torch_tokenizer = BertTokenizerFast.from_pretrained(pretrained_path)
torch_tokenizer.save_pretrained(pretrained_path)
Now you will find tokenizer.json
under pretrained_path
Thanks! I will close this issue.
Thanks for the port of ernie layout to torch!
I have two question regarding
ernie-layoutx-base-uncased
conversion:Norm/ERNIE-Layout-Pytorch
)?ErnieLayoutTokenizer
can be replaced by a fast tokenizer? E.g. use xlnet tokenizer (that uses sentencepiece tokenizer, as well) instead. For my current pipeline, I need some attributes/methods that are currently not supported byErnieLayoutTokenizer
.