evaluation: type scores

bootphon / wordseg

A Python toolbox for text based word segmentation

https://docs.cognitive-ml.fr/wordseg

GNU General Public License v3.0

16 stars 7 forks source link

evaluation: type scores #11

Closed cainesap closed 6 years ago

cainesap commented 6 years ago

Hello,

We've noticed type and token scores are always the same. It seems to me that in evaluate.py there is a difference in how the two are calculated, with the type score based on these lines --

def _stringpos_typepos(stringpos):
    return [{pos for pos in line} for line in stringpos]

-- which is effectively the same as token matching, but by indices instead (correct me if I'm wrong).

I'd be very happy to implement a type scoring function if I can, as a pull request, but first wanted to check the intuition behind it (sorry, I did try googling for more info about it, but in vain). Firstly, we're talking word types (as opposed to tokens), right? So is it supposed to be a list comparison of the gold and hypothesised lexicons? e.g. out of hypotheses {a, b, c, d} how many are also found in the gold lexicon {a, b, e} .. p=2/4, r=2/3

cheers, Andrew

cainesap commented 6 years ago

I've edited 'evaluate.py' with type scoring, as I understand it.

It works, as far as I can tell. But please tell me if I've misunderstood type scoring, or if this edit is in fact unwanted.

https://www.dropbox.com/s/59kooi8bvlod8pg/evaluate.py?dl=0

I've written comments prefixed "AC" where I've added new lines of code, and not deleted any pre-existing lines of code, just commented them out where necessary.

Andrew

mmmaat commented 6 years ago

Thank's Andrew, can you make a pull request please? So that we can discuss it, test it, etc...

cainesap commented 6 years ago

Yes sure, I wanted to do this but afaik I have to be made a collaborator to make a pull request? Or I can fork a copy of the repo and push my changes to that ..? (sorry, not done pull requests before)

mmmaat commented 6 years ago

Yes the idea is that you work in your own fork, or better in a dedicated branch of your fork. (no need to be a collaborator). This is a bit scary the first time but it is quite convenient to collaborate on a project!

https://help.github.com/articles/creating-a-pull-request/ https://help.github.com/articles/about-pull-requests/

If you can't handle it I'll do it tomorrow.

cainesap commented 6 years ago

Ok thank you Mathieu .. I've done that, and I think I've created a pull request. Hopefully you were notified

mmmaat commented 6 years ago

See discussion in PR #14

cainesap commented 6 years ago

Ok thank you Mathieu

mmmaat commented 6 years ago

Hi all, before to merge the changes proposed by @cainesap, I want to be sure the evaluation code is correct. Here are some simple examples, please give your feedback on the results...

I also welcome other toy tests if you have better ideas, thanks!

gold = 'the dog bites the dog'

text = 'the dog bites thedog' type_fscore 0.8571 type_precision 0.75 type_recall 1 token_fscore 0.6667 token_precision 0.75 token_recall 0.6 boundary_fscore 0.8571 boundary_precision 1 boundary_recall 0.75
text = 'thedog bites thedog' type_fscore 0.4 type_precision 0.5 type_recall 0.3333 token_fscore 0.25 token_precision 0.3333 token_recall 0.2 boundary_fscore 0.6667 boundary_precision 1 boundary_recall 0.5
text = 'thedogbitesthe dog' type_fscore 0.4 type_precision 0.5 type_recall 0.3333 token_fscore 0.2857 token_precision 0.5 token_recall 0.2 boundary_fscore 0.4 boundary_precision 1 boundary_recall 0.25

mmmaat commented 6 years ago

Few more tests (all seems OK, I'm merging the changes)

text = 'thedogbitest hedog' type_fscore 0 type_precision 0 type_recall 0 token_fscore 0 token_precision 0 token_recall 0 boundary_fscore 0 boundary_precision 0 boundary_recall 0
text = 'th e dog bit es the d og' type_fscore 0.3636 type_precision 0.25 type_recall 0.6667 token_fscore 0.3077 token_precision 0.25 token_recall 0.4 boundary_fscore 0.7273 boundary_precision 0.5714 boundary_recall 1
gold = 'the bandage of the band age' text = 'the band age of the band age' type_fscore 0.8889 type_precision 1 type_recall 0.8 token_fscore 0.7692 token_precision 0.7143 token_recall 0.8333 boundary_fscore 0.9091 boundary_precision 0.8333 boundary_recall 1