Closed marfox closed 8 years ago
Tagger | Pros | Cons |
---|---|---|
TreeTagger | - Lots of language resources available; - Python wrapper | - Slow if single thread |
NLTK's default | - One line of code to run it | - not stated which tagger is used |
Speed comparison Over a randomly generated sample of 5000 items, 3 times. TreeTagger (via treetaggerwrapper) is 3 times faster, because it supports parallel tagging, unlike nltk.
$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o 86.96s user 0.28s system 100% cpu 1:27.21 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o tt 53.90s user 12.73s system 266% cpu 25.042 total
$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o 92.94s user 0.28s system 99% cpu 1:33.25 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o tt 56.78s user 12.47s system 271% cpu 25.479 total
$ cat ~/StrepHit/corpus/*.jsonlines | shuf -n 5000 > ~/StrepHit/corpus/benchmark/b.jsonlines
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger nltk
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o 90.61s user 0.21s system 100% cpu 1:30.80 total
$ time python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o /dev/null --tagger tt
python -m strephit commons pos_tag ~/StrepHit/corpus/benchmark bio en -o tt 55.76s user 12.69s system 270% cpu 25.327 total
Opted fot TreeTagger
Provide a table of candidate part-of-speech (POS) taggers, with pros and cons details. Focus on English, but multilingual support has a high priority.