Special tokens should be properly encoded by text_encoders

ricardorei commented 4 years ago

Expected Behavior

encoder = MosesEncoder( ["\<s> hello This ain't funny. \</s>", "\<s> Don't? \</s>"]) print (encoder.encode("\<s> hello \</s>"))

--CONSOLE--- tensor([3, 5, 2])

Actual Behavior

encoder = MosesEncoder( ["\<s> hello This ain't funny. \</s>", "\<s> Don't? \</s>"]) print (encoder.encode("\<s> hello \</s>"))

--CONSOLE--- tensor([ 5, 6, 7, 8, 5, 14, 6, 7])

Explanation

Most if this tokenizers are not aware of this special tokens and end up splitting the special token into different tokens. For instance the '\<s>' token becames '\<', 's', '>'.

My solution to this problem was to create a method for masking special tokens and another one to restore them in place.

   def _mask_reserved_tokens(self, sequence):
        reserved_tokens = re.findall(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', sequence)
        sequence = re.sub(r'\<pad\>|\<unk\>|\</s\>|\<s\>|\<copy\>', "RESERVEDTOKENMASK", sequence)
        return reserved_tokens, sequence

    def _restore_reserved_tokens(self, reserved_tokens, sequence):
        sequence = _detokenize(sequence)
        for token in reserved_tokens:
            sequence = sequence.replace('RESERVEDTOKENMASK', token, 1)
        return _tokenize(sequence)

Then the encode function becames:

def encode(self, sequence):
        """ Encodes a ``sequence``.
        Args:
            sequence (str): String ``sequence`` to encode.
        Returns:
            torch.Tensor: Encoding of the ``sequence``.
        """
        sequence = super().encode(sequence)
        reserved_tokens, sequence = self._mask_reserved_tokens(sequence)
        sequence = self.tokenize(sequence)
        sequence = self._restore_reserved_tokens(reserved_tokens, sequence)
        vector = [self.stoi.get(token, self.unknown_index) for token in sequence]
        if self.append_eos:
            vector.append(self.eos_index)
        return torch.tensor(vector)

I dont know if this is just a problem that I have but If not I believe that this should be handled natively.

PetrochukM commented 4 years ago

Thanks for submitting this issue.

Is there a particular reason that the end of sentence token and the beginning of sentence token cannot be added after the fact?

By the way, your solution may be the correct approach, Google's Subword Tokenizer handles special tokens in a similar way: https://github.com/PetrochukM/PyTorch-NLP/blob/master/torchnlp/encoders/text/subword_text_tokenizer.py#L101

They use unicode to encode various special characters.

ricardorei commented 4 years ago

In my case im building sequence to sequence models and the decoder needs the bos and eos tokens.

Lets supose you have the following sample:

[{'source': 'hello world!', 'target': 'hello'}, {'source': 'Bye', 'target': 'goodbye'}]

If you apply collate vector:

{'source': ['hello world!', 'Bye' ], 'target': ['hello', 'goodbye'] }

now if you apply the batch_encode dor the 'source' you get something like:

tensors = torch.tensor([[7, 19],[14, 0]])
lengths = torch.tensor([2,1])

If now I want to add the BOS and EOS token indexes its harder because I have to expand the tensors and increment the lengths... I just think its easier to add this tokens directly into the strings before vectorization.

PetrochukM commented 4 years ago

Great. Sure thing.

There is an append_eos parameter that helps with appending an EOS token. I'd be happy to accept a PR for append_sos or append_bos parameter to be included.

ksopyla commented 4 years ago

I have experienced similar problem when using CharacterEncoder

tokenizer.encode('<s>hello</s>')

gives

tensor([ 1,  8,  1, 39, 14,7,  7,11,  1,  1,  8,  1])

where token with id=1 is \<unk> I prefere adding eos and sos (bos) tokens by tokenizer, it simplifies all the processing in seq2seq scenario

PetrochukM commented 4 years ago

Thanks for your contribution @wxl1999! append_sos is now supported :)

PetrochukM / PyTorch-NLP