Closed ricardorei closed 4 years ago
Thanks for submitting this issue.
Is there a particular reason that the end of sentence token and the beginning of sentence token cannot be added after the fact?
By the way, your solution may be the correct approach, Google's Subword Tokenizer handles special tokens in a similar way: https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/encoders/text/subword_text_tokenizer.py#L101
They use unicode to encode various special characters.
In my case im building sequence to sequence models and the decoder needs the bos and eos tokens.
Lets supose you have the following sample:
[{'source': 'hello world!', 'target': 'hello'}, {'source': 'Bye', 'target': 'goodbye'}]
If you apply collate vector:
{'source': ['hello world!', 'Bye' ], 'target': ['hello', 'goodbye'] }
now if you apply the batch_encode dor the 'source' you get something like:
tensors = torch.tensor([[7, 19],[14, 0]])
lengths = torch.tensor([2,1])
If now I want to add the BOS and EOS token indexes its harder because I have to expand the tensors and increment the lengths... I just think its easier to add this tokens directly into the strings before vectorization.
Great. Sure thing.
There is an append_eos
parameter that helps with appending an EOS token. I'd be happy to accept a PR for append_sos
or append_bos
parameter to be included.
I have experienced similar problem when using CharacterEncoder
tokenizer.encode('<s>hello</s>')
gives
tensor([ 1, 8, 1, 39, 14,7, 7,11, 1, 1, 8, 1])
where token with id=1 is \<unk> I prefere adding eos and sos (bos) tokens by tokenizer, it simplifies all the processing in seq2seq scenario
Thanks for your contribution @wxl1999! append_sos
is now supported :)
Expected Behavior
encoder = MosesEncoder( ["\<s> hello This ain't funny. \</s>", "\<s> Don't? \</s>"]) print (encoder.encode("\<s> hello \</s>"))
--CONSOLE--- tensor([3, 5, 2])
Actual Behavior
encoder = MosesEncoder( ["\<s> hello This ain't funny. \</s>", "\<s> Don't? \</s>"]) print (encoder.encode("\<s> hello \</s>"))
--CONSOLE--- tensor([ 5, 6, 7, 8, 5, 14, 6, 7])
Explanation
Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '\<s>' token becames '\<', 's', '>'.
My solution to this problem was to create a method for masking special tokens and another one to restore them in place.
Then the encode function becames:
I dont know if this is just a problem that I have but If not I believe that this should be handled natively.