evanmiltenburg / Dutch-tagger

Simple perceptron tagger trained using the NLTK on the NLCOW14 corpus.
25 stars 0 forks source link

Dutch tagger

Don't use this tagger for actual research or production! Use SpaCy instead (faster, more reliable). I'm leaving this up only as educational material.

This repository contains a trained part-of-speech tagger for Dutch, as well as the code used to train it. (The file cowparser.py comes from this repository.) Don't use the tagger in a production environment, unless you train it yourself using some other data. This code just shows you how the NLTK tagger works. I recommend Treetagger, Frog, or SpaCy.

Requirements:

Key facts:

How to use the tagger.

First run bash create_models.sh. This will create the models for you. Then use the following code.

from nltk.tag.perceptron import PerceptronTagger

# This may take a few minutes. (But once loaded, the tagger is really fast!)
tagger = PerceptronTagger(load=False)
tagger.load('model.perc.dutch_tagger_small.pickle')

# Tag a sentence.
tagger.tag('Alle vogels zijn nesten begonnen , behalve ik en jij .'.split())

Result:

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]

If the text is not tokenized yet, you can use the built-in tokenizer from the NLTK (be sure to download the NLTK data):

import nltk.data
from nltk.tokenize import word_tokenize

sent_tokenizer = nltk.data.load('tokenizers/punkt/dutch.pickle')

def tokenize(text):
    for sentence in sent_tokenizer.tokenize(text):
        yield word_tokenize(sentence)

sentences = tokenize('Alle vogels zijn nesten begonnen, behalve ik en jij. Waar wachten wij nu op?')

for sentence in sentences:
    print(tagger.tag(sentence))

Result:

[('Alle', 'det__indef'), ('vogels', 'nounpl'), ('zijn', 'verbprespl'), ('nesten', 'nounpl'), ('begonnen', 'verbpapa'), (',', 'punc'), ('behalve', 'conjsubo'), ('ik', 'pronpers'), ('en', 'conjcoord'), ('jij', 'pronpers'), ('.', '$.')]
[('Waar', 'pronadv'), ('wachten', 'verbprespl'), ('wij', 'pronpers'), ('nu', 'adv'), ('op', 'adv'), ('?', '$.')]