Closed Luolc closed 5 years ago
Thanks for the interest and for pointing this out! I shall work on a fix and run tests to confirm if this is actually having an impact on the performance itself.
On initial thought, I don't think it is. Our experiments showed us that accuracy of all models for the 6 profiles were almost the same. (For example, see the section on Multi-task Learning in the paper: https://arxiv.org/abs/1706.07503.) In this case, the model must have trained embeddings for 'middle' and 'aged' separately, but when storing the profile memory, it takes a sum of the various embeddings it is composed of. What do you think?
Regardless, splitting 'middle-aged' is not the behavior we intended, so I shall fix soon.
P.S. Please make a pull request if you have already made a fix for it! :)
I did fix it on my local code, but I also changed the code a lot to build my own model. It might be hard to create PR, sorry for that.
Actually it is a one-line fix at:
https://github.com/chaitjo/personalized-dialog/blob/master/MemN2N/data_utils.py#L56
re.split('(\W+)?', sent)
-> re.split('([^A-Za-z0-9-]+)', sent)
add -
and the ?
is unnecessary.
P.S. should also on data_utils
in other folders.
I also found that the accuracy won't be affected by this minor thing based on my experiment. It is reasonable as 'middle-aged' could be seen as the sum of 'middle' and 'aged' in the embedding space. I agree with this.
Hi! I am working on personalized task-oriented dialog as well and investigating on your model.
I find a minor bug: The
data_utils.tokenize
cannot treat "middle-aged" as one word.Actually,
tokenize('male middle-aged')
returns['male', 'middle', '-', 'aged']
. I think it should be['male', 'middle-aged']
instead. This might affect the performance of baseline memN2N