[Question]: 请问BatchEncoding这个类的char_to_token方法是还用不了吗？

PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.

https://paddlenlp.readthedocs.io

Apache License 2.0

12.01k stars 2.93k forks source link

[Question]: 请问BatchEncoding这个类的char_to_token方法是还用不了吗？ #3990

Closed Li-fAngyU closed 1 year ago

Li-fAngyU commented 1 year ago

请提出你的问题

源码位置

案例：

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = tokenizer_test('asd. asd. aaa')
tokenized.char_to_token(0)

报错信息：

    585         if not self._encodings:
    586             raise ValueError(
--> 587                 "char_to_token() is not available when using Python based tokenizers"
    588             )
    589         if char_index is not None:

ValueError: char_to_token() is not available when using Python based tokenizers

请问可以提供一个实现方式吗？(希望能和huggingface/transformers的char_to_token对齐)

Li-fAngyU commented 1 year ago

顺便问一下，我要怎么把tokenized['input_ids']转换回字符串呢？

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = tokenizer_test('asd. asd. aaa')
print(tokenized['input_ids'])
#[101, 2004, 2094, 1012, 2004, 2094, 1012, 13360, 102]

期望是[101, 2004, 2094, 1012, 2004, 2094, 1012, 13360, 102] to 'asd. asd. aaa'，是不是需要通过tokenizer_test.get_vocab()进行转换呢？tokenized有没有API可以直接完成这个需求呢？

wj-Mcat commented 1 year ago

tokenizer 当中有很多基础方法提供给开发者使用，比如：encode、tokenize、convert_tokens_to_strings 等方法，详细可见：tokenizer of paddlenlp

针对于你这个场景的示例代码如下所示：

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = 'asd. asd. aaa'

feature = tokenizer(sentence, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
string = tokenizer.convert_tokens_to_string(tokens)
print(string)

可是目前 string 的结果是：'asd . asd . aaa'，这个结果也是和huggingface 中的 BertTokenizer 对齐的，测试代码如下所示：

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence = 'asd. asd. aaa'

feature = tokenizer(sentence, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
string = tokenizer.convert_tokens_to_string(tokens)
assert string == 'asd . asd . aaa'

wj-Mcat commented 1 year ago

我经过深入测试之后发现，paddlenlp BertTokenizer 和hf BertTokenzier结果是对齐的，可是和 hf 的 BertTokenizerFast 是没办法对齐的，hf 的测试结果如下所示：

def test_tokenizer():
    from transformers import BertTokenizer  ## 非 Fast Tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == 'asd . asd . aaa'

    from transformers import AutoTokenizer   ## Fast Tokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == sentence

test_tokenizer()

Li-fAngyU commented 1 year ago

非常感谢您的解答！我目前主要的需求是 char_to_token 这个API。请问paddlenlp 中的这个API是否能够使用呢？

Li-fAngyU commented 1 year ago

我经过深入测试之后发现，paddlenlp BertTokenizer 和hf BertTokenzier结果是对齐的，可是和 hf 的 BertTokenizerFast 是没办法对齐的，hf 的测试结果如下所示：

def test_tokenizer():
    from transformers import BertTokenizer  ## 非 Fast Tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == 'asd . asd . aaa'

    from transformers import AutoTokenizer   ## Fast Tokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == sentence

test_tokenizer()

assert string == 'asd . asd . aaa' 这里应该是assert string == 'asd. asd. aaa' 这样的吧

wj-Mcat commented 1 year ago

上面的测试是可以通过的，你可以通过 assert 发下hf 中在 tokenizer 这块的细节区别。

Li-fAngyU commented 1 year ago

上面的测试是可以通过的，你可以通过 assert 发下hf 中在 tokenizer 这块的细节区别。

噢噢我理解您的意思了，是的上面的测试是可以通过的。然后paddlenlp BertTokenizer是与hf的非 Fast Tokenizer 对齐的是吗？

wj-Mcat commented 1 year ago

@Li-fAngyU 是的，至于 char_to_token这个方法基本上不维护了，建议你通过其他的api 来实现你的需求，如果有不清楚随时在这里沟通。

现在公开的convert_系列的方法基本上可以满足你的特殊处理需求了，如果不行的话，可以告知我们，我们来给解决方案。

Li-fAngyU commented 1 year ago

好滴，十分感谢！我现在是在用paddle 复现论文，因为参考代码中使用了挺多的char_to_token, 所以是想看能否直接从 paddlenlp里获取对齐的API。（因为我也不是很保证自己写的char_to_token是否能够满足与hf的char_to_token全场景对齐）

Li-fAngyU commented 1 year ago

您好，我在尝试实现char_to_token的时候，发现了下面这个案例，我不知道该如何解决这种情况。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
test_string = 'elephant hydrant'
tokenized = tokenizer(test_string)
res = []
for i in range(len(test_string)):
    res.append(tokenized.char_to_token(i))
print(res)
print(test_string)

输出结果：

[1, 1, 1, 1, 1, 1, 1, 1, None, 2, 2, 2, 2, 2, 3, 3]
elephant hydrant

对于hydrant这个单词, 好像因为hydra构成一个单词(九头蛇)，hydrant构成单词(消防栓)，所以hf的char_to_token将其分开了，而我查询tokenizer.get_vocab()['hydrant']在vocab里是没有这个单词的。

请问这种情况该如何解决呢？

Li-fAngyU commented 1 year ago

感谢解答，char_to_token的问题解决了

wj-Mcat commented 1 year ago

感谢解答，char_to_token的问题解决了

我很好奇你这边是如何解决的，能把你的解决方案分享出来吗？这样其他的开发者遇到与你类似的问题就都可以知道了。

非常期待和感谢你的分享。

Li-fAngyU commented 1 year ago

可以呀，抱歉这么晚才看到。我的解决方案可能只适用于我所使用的这个场景。思路：按照hf的char_to_token的映射规律，去得到char_to_token对应的映射list。相较于hf，需要额外引入tokenizer。

def get_char_to_token_list(tokenized, tokenizer):
    tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])
    count = 1
    seq = []
    last_char = '.'
    for i in tokens:
        if '[CLS]' in i or '[SEP]' in i:
            continue
        elif '#' in i:   # 说明该token为连词
            for j in range(len(i)-2):
                seq.append(count)
        elif '.' in i:
            seq.append(count)
            seq.append(None)  # None 代表空格
        else:
            if last_char != '.':  # 用于判断该token是否为短语，在我的情况里字符'.'是用来分隔各个短语的。eg. "boat. traffic light. motorcycle" 
                seq.append(None)
            for j in range(len(i)):
                seq.append(count)
        last_char = i
        count +=1
    return seq

char_to_token_list = get_char_to_token_list(tokenized, tokenizer)
# char_to_token_list[i] <--> hf.tokenized.char_to_tokens(i)

最终char_to_token_list[i] 就等价于hf.tokenized.char_to_tokens(i)，希望这能提供帮助。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动，被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天，即将关闭。