Closed cainesap closed 6 years ago
I've edited 'evaluate.py' with type scoring, as I understand it.
It works, as far as I can tell. But please tell me if I've misunderstood type scoring, or if this edit is in fact unwanted.
https://www.dropbox.com/s/59kooi8bvlod8pg/evaluate.py?dl=0
I've written comments prefixed "AC" where I've added new lines of code, and not deleted any pre-existing lines of code, just commented them out where necessary.
Andrew
Thank's Andrew, can you make a pull request please? So that we can discuss it, test it, etc...
Yes sure, I wanted to do this but afaik I have to be made a collaborator to make a pull request? Or I can fork a copy of the repo and push my changes to that ..? (sorry, not done pull requests before)
Yes the idea is that you work in your own fork, or better in a dedicated branch of your fork. (no need to be a collaborator). This is a bit scary the first time but it is quite convenient to collaborate on a project!
https://help.github.com/articles/creating-a-pull-request/ https://help.github.com/articles/about-pull-requests/
If you can't handle it I'll do it tomorrow.
Ok thank you Mathieu .. I've done that, and I think I've created a pull request. Hopefully you were notified
See discussion in PR #14
Ok thank you Mathieu
Hi all, before to merge the changes proposed by @cainesap, I want to be sure the evaluation code is correct. Here are some simple examples, please give your feedback on the results...
I also welcome other toy tests if you have better ideas, thanks!
gold = 'the dog bites the dog'
text = 'the dog bites thedog' type_fscore 0.8571 type_precision 0.75 type_recall 1 token_fscore 0.6667 token_precision 0.75 token_recall 0.6 boundary_fscore 0.8571 boundary_precision 1 boundary_recall 0.75
text = 'thedog bites thedog' type_fscore 0.4 type_precision 0.5 type_recall 0.3333 token_fscore 0.25 token_precision 0.3333 token_recall 0.2 boundary_fscore 0.6667 boundary_precision 1 boundary_recall 0.5
text = 'thedogbitesthe dog' type_fscore 0.4 type_precision 0.5 type_recall 0.3333 token_fscore 0.2857 token_precision 0.5 token_recall 0.2 boundary_fscore 0.4 boundary_precision 1 boundary_recall 0.25
Few more tests (all seems OK, I'm merging the changes)
text = 'thedogbitest hedog' type_fscore 0 type_precision 0 type_recall 0 token_fscore 0 token_precision 0 token_recall 0 boundary_fscore 0 boundary_precision 0 boundary_recall 0
text = 'th e dog bit es the d og' type_fscore 0.3636 type_precision 0.25 type_recall 0.6667 token_fscore 0.3077 token_precision 0.25 token_recall 0.4 boundary_fscore 0.7273 boundary_precision 0.5714 boundary_recall 1
gold = 'the bandage of the band age' text = 'the band age of the band age' type_fscore 0.8889 type_precision 1 type_recall 0.8 token_fscore 0.7692 token_precision 0.7143 token_recall 0.8333 boundary_fscore 0.9091 boundary_precision 0.8333 boundary_recall 1
Hello,
We've noticed type and token scores are always the same. It seems to me that in
evaluate.py
there is a difference in how the two are calculated, with the type score based on these lines ---- which is effectively the same as token matching, but by indices instead (correct me if I'm wrong).
I'd be very happy to implement a type scoring function if I can, as a pull request, but first wanted to check the intuition behind it (sorry, I did try googling for more info about it, but in vain). Firstly, we're talking word types (as opposed to tokens), right? So is it supposed to be a list comparison of the gold and hypothesised lexicons? e.g. out of hypotheses {a, b, c, d} how many are also found in the gold lexicon {a, b, e} .. p=2/4, r=2/3
cheers, Andrew