eole-nlp / eole

Open language modeling toolkit based on PyTorch
https://eole-nlp.github.io/eole
MIT License
15 stars 5 forks source link

prefix transform bug? #56

Open sersh88 opened 4 days ago

sersh88 commented 4 days ago

In prefix transform I see the code to prepend prefix:

 def _prepend(self, example, prefix):
        """Prepend `prefix` to `tokens`."""
        for side, side_prefix in prefix.items():
            if example.get(side) is not None:
                example[side] = side_prefix.split(" ") + example[side]
            elif len(side_prefix) > 0:
                example[side] = side_prefix.split(" ")
        return example

Looks like it prepends prefix even if it's empty. I think it should be like:

 def _prepend(self, example, prefix):
        """Prepend `prefix` to `tokens`."""
        for side, side_prefix in prefix.items():
            if example.get(side) is not None and len(side_prefix) > 0:
                example[side] = side_prefix.split(" ") + example[side]
            elif len(side_prefix) > 0:
                example[side] = side_prefix.split(" ")
        return example
vince62s commented 4 days ago

what is the issue exactly ?

sersh88 commented 4 days ago

if tgt_prefix, for example, is empty, the transform anyways adds empty token to each tgt sequence, which is unknown token for the model. So, all dataset is trained with prepended unknown token to tgt. like ['', 'some', 'other', 'tokens'] Same with suffix transform.

vince62s commented 2 days ago

I see, can you PR?