explosion / spaCy

💫 Industrial-strength Natural Language Processing (NLP) in Python
https://spacy.io
MIT License
29.99k stars 4.39k forks source link

💫 Generic lemmatization & morphology #846

Closed honnibal closed 7 years ago

honnibal commented 7 years ago

Simple and generic lemmatization functionality can be implemented as a look-up table. The lemmatization lists here look good: http://www.lexiconista.com/datasets/lemmatization/

To make this work, we need a simple class that behaves similarly to the existing spacy.lemmatizer.Lemmatizer class. A language subclass can then create this lookup lemmatizer in its Language.Defaults.create_lemmatizer() method.

This will give us a low-effort start on lemmatization in lots of languages, that can be replaced by more sophisticated strategies on a case-by-case basis.

Wishlist extension

It would be very interesting to try a sequence-to-sequence model to generate the lemmas, using the existing lists as training. The sequence-to-sequence model would then be used when the lookup fails. I think this might perform quite well, especially if a POS tag can be supplied as a feature.

To be specific, the sequences are the characters, and the problem is analogous to neural machine translation. So, example NMT architectures would be the best place to start.

Details

TerminalWitchcraft commented 7 years ago

Already started to implement!

ines commented 7 years ago

Copying over the current state from #390 and #974 (and closing those) to make this one the master issue 🎉

Comment by @Liebeck on #390:

Hi, I just wanted to point you to one of my older projects called IWNLP which produces a list of form -> lemma (e.g., Schwimmbäder -> Schwimmbad) for German words based on Wiktionary. Check out https://github.com/Liebeck/IWNLP and www.iwnlp.com or query me for more information. The produced mappings are in the form of (form, POS) -> lemma, are under a CC-lincense, and I'd love to see them implemented ;) I might be able to dump another format (or a generic format, if specified) if that's desired? The latest evaluation results are listed here: http://www.iwnlp.com/iwnlp_results.html

Response by @honnibal on #390:

This is really nice! We'd definitely like to see this implemented. Sorry I missed this comment before!

This should be quite easy actually. If you check out the morph_rules.py file, you can see that we already have this mechanism that takes does (form, POS) -> lemma mapping. This is how we do lemmatization for many English words.

Gregory-Howard commented 7 years ago

Hi, I want to finish this. I don't really get want you want in term of code. Should I add a simple class similar to spacy.lemmatizer.Lemmatizer like spacy.lemmatizer.LemmatizerLookUp, which just have a call function. Or modify spacy.lemmatizer.Lemmatizer to allowing it to be simplified?

Liebeck commented 7 years ago

I'm also not sure how to proceed. Do you want me to include the mapping from IWNLP directly into the code of spaCy? Won't there be a clash between your MIT License and the CC license (which I'm forced to use since the Wiktionary data is CC licensed)? I guess an easy way to circumvent this would be to create a file format/model which can then be downloaded by spaCy?

@Gregory-Howard Do you want to include the lemmatization mapping from IWNLP (which would be totally fine, I'm rather busy at the moment) or were you talking about something else? ;)

oroszgy commented 7 years ago

@honnibal I would argue against using such a simplistic lemmatization. There are many cases even in English where the lemma depends on the PoS of the wordform. (e.g. "rung"). This problem is much more notable for morphologically complex languages. My suggestion would be to implement something like this: https://cst.dk/online/lemmatiser/uk/

Gregory-Howard commented 7 years ago

@Liebeck Actually my plan was to follow what said @honnibal with word => lemma. (really limited but it's a beginning) @oroszgy In french we have this : https://pastebin.com/NjvtQvHG (500 000 words defined) collumns : word,TAG,lemma, sing/plur/m/f

ines commented 7 years ago

Closing this – PR #1024 will be implemented in v2.0. (This is obviously only a start and we'll also have a more detailed lemmatization model, now that the parser will also predict morphological analysis, see #1057.)

lock[bot] commented 6 years ago

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.