Wikidata / StrepHit

An intelligent reading agent that understands text and translates it into Wikidata statements.
https://meta.wikimedia.org/wiki/Grants:IEG/StrepHit:_Wikidata_Statements_Validation_via_References
GNU General Public License v3.0
112 stars 14 forks source link

POS Tagger #3

Closed marfox closed 8 years ago

marfox commented 8 years ago

Provide a table of candidate part-of-speech (POS) taggers, with pros and cons details. Focus on English, but multilingual support has a high priority.

marfox commented 8 years ago
Tagger Pros Cons
TreeTagger - Lots of language resources available; - Python wrapper - Slow if single thread
NLTK's default - One line of code to run it - not stated which tagger is used
e-dorigatti commented 8 years ago

Speed comparison Over a randomly generated sample of 5000 items, 3 times. TreeTagger (via treetaggerwrapper) is 3 times faster, because it supports parallel tagging, unlike nltk.

$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o     86.96s user 0.28s system 100% cpu 1:27.21 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o   tt  53.90s user 12.73s system 266% cpu 25.042 total
$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o     92.94s user 0.28s system 99% cpu 1:33.25 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o   tt  56.78s user 12.47s system 271% cpu 25.479 total
$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o     90.61s user 0.21s system 100% cpu 1:30.80 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o   tt  55.76s user 12.69s system 270% cpu 25.327 total
marfox commented 8 years ago

Opted fot TreeTagger