clips / pattern

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
https://github.com/clips/pattern/wiki
BSD 3-Clause "New" or "Revised" License
8.74k stars 1.58k forks source link

lemmatization errors German verb 'sein' #95

Open markuskiller opened 10 years ago

markuskiller commented 10 years ago

@CJAnti reported a pattern related issue to textblob-de which uses a Python3 compatible version of pattern.de or the original pattern distribution (if installed) on Python2.

https://github.com/markuskiller/textblob-de/issues/9

EXPECTED: Du bist --> Lemma: sein

Ihr seid --> Lemma: sein


# Tested on Python2.7.8, 32bit, on Windows 8.1 (64bit)

# pattern.__version__ 
# '2.6'

In [1]: from pattern.de import parse, pprint

In [2]: pprint(parse("Ich bin. Du bist. Er ist. Wir sind. Ihr seid. Sie sind.", lemmata=True))
          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ich   PRP    NP      -      -      -      ich
           bin   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Du   PRP    NP      -      -      -      du
          bist   NN     NP ^    -      -      -      bist
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

            Er   PRP    NP      -      -      -      er
           ist   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Wir   PRP    NP      -      -      -      wir
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Ihr   PRP$   NP      -      -      -      ihr
          seid   NN     NP ^    -      -      -      seid
             .   .      -       -      -      -      .

          WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

           Sie   PRP    NP      -      -      -      sie
          sind   VB     VP      -      -      -      sein
             .   .      -       -      -      -      .

In [3]: pprint(parse("Ihr seid alle herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK    ROLE   ID     PNP    LEMMA

            Ihr   PRP$   NP       -      -      -      ihr
           seid   NN     NP ^     -      -      -      seid
           alle   RB     ADJP     -      -      -      alle
       herzlich   JJ     ADJP ^   -      -      -      herzlich
     eingeladen   VBN    VP       -      -      -      einladen
             zu   IN     PP       -      -      PNP    zu
         meinem   PRP$   NP       -      -      PNP    meinem
Geburtstagsfest   NN     NP ^     -      -      PNP    geburtstagsfest
              .   .      -        -      -      -      .

In [4]: pprint(parse("Du bist herzlich eingeladen zu meinem Geburtstagsfest.", lemmata=True))
           WORD   TAG    CHUNK   ROLE   ID     PNP    LEMMA

             Du   PRP    NP      -      -      -      du
           bist   NN     NP ^    -      -      -      bist
       herzlich   JJ     ADJP    -      -      -      herzlich
     eingeladen   VBN    VP      -      -      -      einladen
             zu   IN     PP      -      -      PNP    zu
         meinem   PRP$   NP      -      -      PNP    meinem
Geburtstagsfest   NN     NP ^    -      -      PNP    geburtstagsfest
              .   .      -       -      -      -      .
CJAnti commented 9 years ago

I don't know, if this is going to help, but I had some time on my hands and it is what I found out so far.

de-lexicon.txt

Added: Du PPER (Like: du PPER) Fixes "Du" to be tagged as personal pronoun at the start of a sentence, but sometimes it is still tagged as JJ somewhere else in a sentence (by a context-rule?).

Changed: Ihr PPOSAT -> Ihr PPER PPOSAT (Like: ihr PPER PPOSAT) Fixes "Ihr" to be tagged as personal pronoun (second person, plural) but breaks its use as possessive pronoun, because the class Lexicon of the pattern module only reads the first two words of each line, so that there is no advantage in writing more tags after a word.

def load(self):
    dict.update(self, (x.split(" ")[:2] for x in _read(self._path)))

Ihr PPOSAT -> ["Ihr", "PPOSAT"] Ihr PPER PPOSAT -> ["Ihr", "PPER"]

Also the POS-tagging of German verbs doesn't work well. I tried putting some of the verbs into the de-lexicon.txt paired with their correct tags and it works most of the time (for example bist VAFIN). So adding all of them to the lexicon would work but seems redundant, because all known verb forms are already there, in form of the de-verbs.txt and the inflect.py class knows about all their conjugated forms and would be able to POS-tag them right with the help of a small new function. Maybe it would help to add a function to search through the verb forms and chose the right tag and call it right after searching for tags in the lexicon in find_tags().