misbehavior of data_utils.tokenize

chaitjo / personalized-dialog

Code for the paper 'Personalization in Goal-oriented Dialog' (NeurIPS 2017 Conversational AI Workshop)

https://chaitjo.github.io/personalization-in-dialog/

MIT License

132 stars 45 forks source link

misbehavior of data_utils.tokenize #6

Closed Luolc closed 5 years ago

Luolc commented 6 years ago

Hi! I am working on personalized task-oriented dialog as well and investigating on your model.

I find a minor bug: The data_utils.tokenize cannot treat "middle-aged" as one word.

Actually, tokenize('male middle-aged') returns ['male', 'middle', '-', 'aged']. I think it should be ['male', 'middle-aged'] instead. This might affect the performance of baseline memN2N

chaitjo commented 6 years ago

Thanks for the interest and for pointing this out! I shall work on a fix and run tests to confirm if this is actually having an impact on the performance itself.

On initial thought, I don't think it is. Our experiments showed us that accuracy of all models for the 6 profiles were almost the same. (For example, see the section on Multi-task Learning in the paper: https://arxiv.org/abs/1706.07503.) In this case, the model must have trained embeddings for 'middle' and 'aged' separately, but when storing the profile memory, it takes a sum of the various embeddings it is composed of. What do you think?

Regardless, splitting 'middle-aged' is not the behavior we intended, so I shall fix soon.

chaitjo commented 6 years ago

P.S. Please make a pull request if you have already made a fix for it! :)

Luolc commented 6 years ago

I did fix it on my local code, but I also changed the code a lot to build my own model. It might be hard to create PR, sorry for that. Actually it is a one-line fix at: https://github.com/chaitjo/personalized-dialog/blob/master/MemN2N/data_utils.py#L56 re.split('(\W+)?', sent) -> re.split('([^A-Za-z0-9-]+)', sent) add - and the ? is unnecessary.

P.S. should also on data_utils in other folders.

Luolc commented 6 years ago

I also found that the accuracy won't be affected by this minor thing based on my experiment. It is reasonable as 'middle-aged' could be seen as the sum of 'middle' and 'aged' in the embedding space. I agree with this.