Tokenizer for ernie-layoutx-base-uncased

maxjeblick commented 1 year ago

Thanks for the port of ernie layout to torch!

I have two question regarding ernie-layoutx-base-uncased conversion:

How can I obtain the corresponding tokenizer files (as e.g. given in Norm/ERNIE-Layout-Pytorch)?
Do you know if ErnieLayoutTokenizer can be replaced by a fast tokenizer? E.g. use xlnet tokenizer (that uses sentencepiece tokenizer, as well) instead. For my current pipeline, I need some attributes/methods that are currently not supported by ErnieLayoutTokenizer.

NormXU commented 1 year ago

You can download corresponding tokenizer files [here](corresponding tokenizer files); or use the following commands:
```
git lfs install
git clone https://huggingface.co/Norm/ERNIE-Layout-Pytorch
```
I actually tried to write an ErnieLayoutTokenizer by overriding XLNetTokenizer, but the outcome seems totally wrong to me. I will update the repo once I successfully implement a correct ErnieLayoutTokenizer.

maxjeblick commented 1 year ago

The following code gives the same tokenization results on a subset on the glue dataset. do_tokenize_postprocess needs to be set to False in paddlenlp tokenizer. Otherwise, postprocessing will remove single _ tokens.

import numpy as np
import pandas as pd
from datasets import load_dataset
from paddlenlp.transformers import AutoTokenizer as PaddleNlpAutoTokenizer, ErnieLayoutTokenizer
from tqdm import tqdm
from transformers import AutoTokenizer, XLNetTokenizerFast

tokenizer_uie_x_base: ErnieLayoutTokenizer = PaddleNlpAutoTokenizer.from_pretrained("uie-x-base")
tokenizer_uie_x_base.do_tokenize_postprocess = False

# path_to_downloaded_transformer_model_weights contains files from
# https://huggingface.co/Norm/ERNIE-Layout-Pytorch/tree/main
# Replace "model_type": "xlnet" in corresponding config.json
# Also, you need to remove max_position_embeddings in config.json, otherwise you will get an error

path_to_downloaded_transformer_model_weights = "ernie_layout_tokenizer"
tokenizer_hf: XLNetTokenizerFast = AutoTokenizer.from_pretrained(path_to_downloaded_transformer_model_weights)

dataset = load_dataset("glue", "ax")
df = pd.DataFrame(dataset)
texts = [row.values[0]['premise'] for idx, row in df.iterrows()]

for idx, text in tqdm(enumerate(texts)):
    if tokenizer_uie_x_base(text)['input_ids'] != tokenizer_hf(text)['input_ids']:
        print("-" * 100)

        print(np.array(tokenizer_hf.tokenize(text)))
        print("*" * 100)
        print(np.array(tokenizer_uie_x_base.tokenize(text)))
        print(np.array(text.split(" ")))
        print(np.array(tokenizer_uie_x_base(text)['input_ids']))
        print(np.array(tokenizer_hf(text)['input_ids']))

        print("=" * 100)

maxjeblick commented 1 year ago

import numpy as np
from datasets import load_dataset
from paddlenlp.transformers import AutoTokenizer as PaddleNlpAutoTokenizer, ErnieLayoutTokenizer
from tqdm import tqdm
from transformers import AutoTokenizer, XLNetTokenizerFast

tokenizer_uie_x_base: ErnieLayoutTokenizer = PaddleNlpAutoTokenizer.from_pretrained("uie-x-base")
tokenizer_uie_x_base.do_tokenize_postprocess = False

# path_to_downloaded_transformer_model_weights contains files from
# https://huggingface.co/Norm/ERNIE-Layout-Pytorch/tree/main
# Replace "model_type": "xlnet" in corresponding config.json
# Also, you need to remove max_position_embeddings in config.json, otherwise you will get an error

path_to_downloaded_transformer_model_weights = "ernie_layout_tokenizer"
tokenizer_hf: XLNetTokenizerFast = AutoTokenizer.from_pretrained(path_to_downloaded_transformer_model_weights)

dataset = load_dataset("amazon_reviews_multi", "all_languages")['train']
print(dataset[0])
print(len(dataset))
# "all_languages", "de", "en", "es", "fr", "ja", "zh"

wrong_dict = dict()
total_dict = dict()
debug = False

for idx, input in tqdm(enumerate(dataset), total=len(dataset)):
    total_dict[input["language"]] = total_dict.get(input["language"], 0) + 1

    text = input['review_body']
    if tokenizer_uie_x_base(text)['input_ids'] != tokenizer_hf(text)['input_ids']:
        wrong_dict[input["language"]] = wrong_dict.get(input["language"], 0) + 1
        if debug:
            print("-" * 100)

            print(np.array(tokenizer_hf.tokenize(text)))
            print("*" * 100)
            print(np.array(tokenizer_uie_x_base.tokenize(text)))
            print(np.array(text.split(" ")))
            print(np.array(tokenizer_uie_x_base(text)['input_ids']))
            print(np.array(tokenizer_hf(text)['input_ids']))

            print("=" * 100)

for key in wrong_dict:
    print(f"Language: {key}")
    print(f"Total: {total_dict[key]}")
    print(f"Wrong total: {wrong_dict[key]}")
    print(f"Wrong relative: {wrong_dict[key] / total_dict[key]}")
    print("-" * 100)

gets

Language: de
Total: 200000
Wrong total: 932
Wrong relative: 0.00466
----------------------------------------------------------------------------------------------------
Language: en
Total: 200000
Wrong total: 594
Wrong %: 0.00297
----------------------------------------------------------------------------------------------------
Language: es
Total: 200000
Wrong total: 555
Wrong relative: 0.002775
----------------------------------------------------------------------------------------------------
Language: fr
Total: 200000
Wrong total: 998
Wrong relative: 0.00499
----------------------------------------------------------------------------------------------------
Language: ja
Total: 200000
Wrong total: 521
Wrong relative: 0.002605
----------------------------------------------------------------------------------------------------
Language: zh
Total: 200000
Wrong total: 1065
Wrong relative: 0.005325
----------------------------------------------------------------------------------------------------

The reason for this is that text with repeated characters "This is the best movie !!!!" is split differently, i.e. "!!!!" gets either tokenized into ["!!!", "!"]or ["!", "!!!"]. For all practical purposes, this shouldn't matter.

Changing to

    text = input['review_body']
    for char in ".,?!*":
        text = re.sub(f'\*{char}', char, text)

gives exactly 1 tokenization difference in japanese.

maxjeblick commented 1 year ago

@NormXU I have an additional question on how you obtained the corresponding tokenizer files you shared on hugginface?

NormXU commented 1 year ago

@NormXU I have an additional question on how you obtained the corresponding tokenizer files you shared on hugginface?

I simply downloaded the file from paddle repo

maxjeblick commented 1 year ago

Your repo also has an additional tokenizer.json file that is needed to load fast tokenizers. I wonder how you where able to create this file. I tried to recreate the file by slow -> fast tokenizer conversion, but I encountered some issues along the way.

PaddleNlp uses sentencepiece tokenizer, but add an offset of 1 to the vocab (see here). In addition, it modifies the vocab by self.tokens_to_ids = {"[CLS]": 0, "[PAD]": 1, "[SEP]": 2, "[UNK]": 3}

NormXU commented 1 year ago

I make use of the BertTokenizerFast class from transformers library.

Suppose all the configuration you downloaded from PaddleNLP are saved under pretrained_path. There should be a tokenizer_config.json, a vocab.txt, a sentencepiece.bpe.model and a special_tokens_map.json. These are files related to tokenizer initialization.

Then, Let's use the BertTokenizerFast from transformers.

from transformers import BertTokenizerFast
torch_tokenizer = BertTokenizerFast.from_pretrained(pretrained_path)
torch_tokenizer.save_pretrained(pretrained_path)

Now you will find tokenizer.jsonunder pretrained_path

maxjeblick commented 1 year ago

Thanks! I will close this issue.

NormXU / ERNIE-Layout-Pytorch

Tokenizer for ernie-layoutx-base-uncased #5