asyml / texar-pytorch

Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
https://asyml.io
Apache License 2.0
744 stars 118 forks source link

A bug in GPT2Tokenizer. #313

Closed tanyuqian closed 3 years ago

tanyuqian commented 4 years ago

GPT2Tokenizer fails to recover a sentence "BART is a seq2seq model." with encoded ids of it. The output sentence is "BART is a seqseq model.". It should be related to numbers' processing.

A script to show the bug is here: https://github.com/tanyuqian/texar-pytorch/blob/master/examples/bart/gpt2_tokenizer_bug.py