PaddlePaddle / PaddleNLP

👑 Easy-to-use and powerful NLP and LLM library with 🤗 Awesome model zoo, supporting wide-range of NLP tasks from research to industrial applications, including 🗂Text Classification, 🔍 Neural Search, ❓ Question Answering, ℹ️ Information Extraction, 📄 Document Intelligence, 💌 Sentiment Analysis etc.
https://paddlenlp.readthedocs.io
Apache License 2.0
12.01k stars 2.93k forks source link

[Question]: 请问BatchEncoding这个类的char_to_token方法是还用不了吗? #3990

Closed Li-fAngyU closed 1 year ago

Li-fAngyU commented 1 year ago

请提出你的问题

源码位置

案例:

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = tokenizer_test('asd. asd. aaa')
tokenized.char_to_token(0)

报错信息:

    585         if not self._encodings:
    586             raise ValueError(
--> 587                 "char_to_token() is not available when using Python based tokenizers"
    588             )
    589         if char_index is not None:

ValueError: char_to_token() is not available when using Python based tokenizers

请问可以提供一个实现方式吗?(希望能和huggingface/transformers的char_to_token对齐)

Li-fAngyU commented 1 year ago

顺便问一下,我要怎么把tokenized['input_ids']转换回字符串呢?

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
tokenized = tokenizer_test('asd. asd. aaa')
print(tokenized['input_ids'])
#[101, 2004, 2094, 1012, 2004, 2094, 1012, 13360, 102]

期望是[101, 2004, 2094, 1012, 2004, 2094, 1012, 13360, 102] to 'asd. asd. aaa',是不是需要通过tokenizer_test.get_vocab()进行转换呢?tokenized有没有API可以直接完成这个需求呢?

wj-Mcat commented 1 year ago

tokenizer 当中有很多基础方法提供给开发者使用,比如:encode、tokenize、convert_tokens_to_strings 等方法,详细可见:tokenizer of paddlenlp

针对于你这个场景的示例代码如下所示:

from paddlenlp.transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = 'asd. asd. aaa'

feature = tokenizer(sentence, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
string = tokenizer.convert_tokens_to_string(tokens)
print(string)

可是目前 string 的结果是:'asd . asd . aaa',这个结果也是和huggingface 中的 BertTokenizer 对齐的,测试代码如下所示:

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
sentence = 'asd. asd. aaa'

feature = tokenizer(sentence, add_special_tokens=False)
tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
string = tokenizer.convert_tokens_to_string(tokens)
assert string == 'asd . asd . aaa'
wj-Mcat commented 1 year ago

我经过深入测试之后发现,paddlenlp BertTokenizer 和hf BertTokenzier结果是对齐的,可是和 hf 的 BertTokenizerFast 是没办法对齐的,hf 的测试结果如下所示:

def test_tokenizer():
    from transformers import BertTokenizer  ## 非 Fast Tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == 'asd . asd . aaa'

    from transformers import AutoTokenizer   ## Fast Tokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == sentence

test_tokenizer()
Li-fAngyU commented 1 year ago

非常感谢您的解答!我目前主要的需求是 char_to_token 这个API。请问paddlenlp 中的这个API是否能够使用呢?

Li-fAngyU commented 1 year ago

我经过深入测试之后发现,paddlenlp BertTokenizer 和hf BertTokenzier结果是对齐的,可是和 hf 的 BertTokenizerFast 是没办法对齐的,hf 的测试结果如下所示:

def test_tokenizer():
    from transformers import BertTokenizer  ## 非 Fast Tokenizer
    tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == 'asd . asd . aaa'

    from transformers import AutoTokenizer   ## Fast Tokenizer
    tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
    sentence = 'asd. asd. aaa'

    feature = tokenizer(sentence, add_special_tokens=False)
    tokens = tokenizer.convert_ids_to_tokens(feature['input_ids'])
    string = tokenizer.convert_tokens_to_string(tokens)
    assert string == sentence

test_tokenizer()

assert string == 'asd . asd . aaa' 这里应该是assert string == 'asd. asd. aaa' 这样的吧

wj-Mcat commented 1 year ago

上面的测试是可以通过的,你可以通过 assert 发下hf 中在 tokenizer 这块的细节区别。

Li-fAngyU commented 1 year ago

上面的测试是可以通过的,你可以通过 assert 发下hf 中在 tokenizer 这块的细节区别。

噢噢 我理解您的意思了,是的上面的测试是可以通过的。然后paddlenlp BertTokenizer是与hf非 Fast Tokenizer 对齐的是吗?

wj-Mcat commented 1 year ago

@Li-fAngyU 是的,至于 char_to_token这个方法基本上不维护了,建议你通过其他的api 来实现你的需求,如果有不清楚随时在这里沟通。

现在公开的convert_系列的方法基本上可以满足你的特殊处理需求了,如果不行的话,可以告知我们,我们来给解决方案。

Li-fAngyU commented 1 year ago

好滴,十分感谢! 我现在是在用paddle 复现论文,因为参考代码中使用了挺多的char_to_token, 所以是想看能否直接从 paddlenlp里获取对齐的API。(因为我也不是很保证自己写的char_to_token是否能够满足与hfchar_to_token全场景对齐)

Li-fAngyU commented 1 year ago

您好,我在尝试实现char_to_token的时候,发现了下面这个案例,我不知道该如何解决这种情况。

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
test_string = 'elephant hydrant'
tokenized = tokenizer(test_string)
res = []
for i in range(len(test_string)):
    res.append(tokenized.char_to_token(i))
print(res)
print(test_string)

输出结果:

[1, 1, 1, 1, 1, 1, 1, 1, None, 2, 2, 2, 2, 2, 3, 3]
elephant hydrant

对于hydrant这个单词, 好像因为hydra构成一个单词(九头蛇),hydrant构成单词(消防栓),所以hfchar_to_token将其分开了,而我查询tokenizer.get_vocab()['hydrant']vocab里是没有这个单词的。

请问这种情况该如何解决呢?

Li-fAngyU commented 1 year ago

感谢解答,char_to_token的问题解决了

wj-Mcat commented 1 year ago

感谢解答,char_to_token的问题解决了

我很好奇你这边是如何解决的,能把你的解决方案分享出来吗?这样其他的开发者遇到与你类似的问题就都可以知道了。

非常期待和感谢你的分享。

Li-fAngyU commented 1 year ago

可以呀,抱歉这么晚才看到。 我的解决方案可能只适用于我所使用的这个场景。 思路:按照hfchar_to_token的映射规律,去得到char_to_token对应的映射list。相较于hf,需要额外引入tokenizer

def get_char_to_token_list(tokenized, tokenizer):
    tokens = tokenizer.convert_ids_to_tokens(tokenized['input_ids'])
    count = 1
    seq = []
    last_char = '.'
    for i in tokens:
        if '[CLS]' in i or '[SEP]' in i:
            continue
        elif '#' in i:   # 说明该token为连词
            for j in range(len(i)-2):
                seq.append(count)
        elif '.' in i:
            seq.append(count)
            seq.append(None)  # None 代表空格
        else:
            if last_char != '.':  # 用于判断该token是否为短语,在我的情况里字符'.'是用来分隔各个短语的。eg. "boat. traffic light. motorcycle" 
                seq.append(None)
            for j in range(len(i)):
                seq.append(count)
        last_char = i
        count +=1
    return seq

char_to_token_list = get_char_to_token_list(tokenized, tokenizer)
# char_to_token_list[i] <--> hf.tokenized.char_to_tokens(i)

最终char_to_token_list[i] 就等价于hf.tokenized.char_to_tokens(i),希望这能提供帮助。

github-actions[bot] commented 1 year ago

This issue is stale because it has been open for 60 days with no activity. 当前issue 60天内无活动,被标记为stale。

github-actions[bot] commented 1 year ago

This issue was closed because it has been inactive for 14 days since being marked as stale. 当前issue 被标记为stale已有14天,即将关闭。