Closed fgregg closed 9 years ago
My first guess would be that in general this is a property of the model - one can interpret it as changing 'bar' to 'foo1' is easier than changing foo1' to 'bar' (which should be possible if you added asymmetrical features like the 'character' features).
But in this case: why are the probabilities when the lengths are the same then also the same? Think it might therefore be a bug - can you give a bit more information like the features and training examples?
[Edit: had the wrong order]
I trained pyharcrf using some real business names that are coreferent. https://github.com/datamade/highered/blob/master/example.py#L11-57
The parameters are here: https://github.com/datamade/highered/blob/master/highered/__init__.py#L10-L28
You can reproduce the asymmetry by running the example.py in the highered repo.
After going over your code and results it seems that the original result is correct. The model learns different weights for inserting and deleting characters. Therefore we should expect that if the two sequences are different lengths then they will almost always have different probabilities when the order is changed.
So the probability of deleting the '1' in 'foo1' in ed('foo1', 'bar')
is lower than inserting '1' in 'bar' in
ed('bar', 'foo1')
. This is because there are more training examples where characters are deleted than there are examples of insertions. The fact that ed('foo', 'bar')
= ed('bar', 'foo')
is a coincidence in this case because the strings are the same length.
If you want a reversible model then always add an example pair and its inverse to the training examples.
Okay, thanks!
If the strings are of different length, then predicted probability depends upon order: