Wordpiece tokenizer tokenize_with_span(text) should return the span begin and end relative to the text

asyml / texar-pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/

https://asyml.io

Apache License 2.0

744 stars 118 forks source link

Wordpiece tokenizer tokenize_with_span(text) should return the span begin and end relative to the text #343

Closed jennyzhang-petuum closed 2 years ago

jennyzhang-petuum commented 2 years ago

Wordpiece tokenizer tokenize_with_span(text) method should return the span begin and end relative to the text rather than the token. Since whitespace_tokenize is called on input text within the function to get the tokens, we should return span begin and end relative to the text directly to avoid another round of tokenization.

gpengzhi commented 2 years ago

Resolved