Order matters in predicted probability of match

dirko / pyhacrf

Hidden alignment conditional random field for classifying string pairs.

BSD 3-Clause "New" or "Revised" License

37 stars 21 forks source link

Order matters in predicted probability of match #4

Closed fgregg closed 9 years ago

fgregg commented 9 years ago

If the strings are of different length, then predicted probability depends upon order:

> print(ed('foo1', 'bar'))
0.459472080321
> print(ed('bar', 'foo1'))
0.506212489757

> print(ed('foo', 'bar'))
0..496366272811
> print(ed('bar', 'foo'))
0..496366272811

dirko commented 9 years ago

My first guess would be that in general this is a property of the model - one can interpret it as changing 'bar' to 'foo1' is easier than changing foo1' to 'bar' (which should be possible if you added asymmetrical features like the 'character' features).

But in this case: why are the probabilities when the lengths are the same then also the same? Think it might therefore be a bug - can you give a bit more information like the features and training examples?

[Edit: had the wrong order]

fgregg commented 9 years ago

I trained pyharcrf using some real business names that are coreferent. https://github.com/datamade/highered/blob/master/example.py#L11-57

The parameters are here: https://github.com/datamade/highered/blob/master/highered/__init__.py#L10-L28

You can reproduce the asymmetry by running the example.py in the highered repo.

dirko commented 9 years ago

After going over your code and results it seems that the original result is correct. The model learns different weights for inserting and deleting characters. Therefore we should expect that if the two sequences are different lengths then they will almost always have different probabilities when the order is changed.

So the probability of deleting the '1' in 'foo1' in ed('foo1', 'bar') is lower than inserting '1' in 'bar' in ed('bar', 'foo1'). This is because there are more training examples where characters are deleted than there are examples of insertions. The fact that ed('foo', 'bar') = ed('bar', 'foo') is a coincidence in this case because the strings are the same length.

If you want a reversible model then always add an example pair and its inverse to the training examples.

fgregg commented 9 years ago

Okay, thanks!