Proposal: Cadmium::Lemmatizer

rmarronnier commented 5 years ago

Preface

Cadmium has a stemmer which is used downstream in several other modules. Its usefulness is not to be questioned.

However relying only on a stemmer will limit Cadmium in different ways :

Stemming a word takes out the grammatical meaning of it, which renders POS tagging impossible.
Only a handful of languages stemming algorithms are implemented. Some languages are by their nature very difficult to stem.

Lemmatization in its implementation is essentially binding a lookup table (or dictionnary) to lemmas and applying additional rules depending on the token found. i18n lemmas lookup tables are freely available and MIT compatible.

Details

Create a new lemmatizer repository in cadmiumcr
Create in it a Cadmium::Lemmatizer module inspired in its form by Cadmium::Util::StopWords
Its data folder will only contain the english json file of lemmas (file size of several Mb)
The name of the shard will be cadmium_lemmatizer

The real difficulty is, IMO, how to deal with data for other languages.

Here are several realistic possibilities :

Create a cadmium_i18n_data shard containing all languages data (might weight tens of Mo of JSON)
Create a cadmium_XX_data shard for each language containing its own data. (Can we regroup repos in a folder in Github ?)
Host somewhere the data and ask developers to download it according to their needs.
Don't provide the data at all and point developers to possible sources.

That's what I could come up with as solutions but if you have other ideas, do tell !

References

Spacy has a good implementation of lemmatizers.

You can check their github repository to have an idea of what the data is like : spanish language for example

watzon commented 5 years ago

So, from what I've seen in other repositories it seems like the standard way to do things is to have language data stored in one place where it's easily accessible, but make it so that it has to be downloaded and included manually. Tesseract does this and I know a lot of other libraries do as well. I would propose that we have a repo, not a shard, where we store a single JSON (or possibly plain text) file for each language. Then we can include in the instructions for using the Lemmatizer "go to x location and download y file for your language".

rmarronnier commented 5 years ago

a single JSON (or possibly plain text) file for each language

Exactly ! This way, we might use this unique repo to store language data even for future Cadmium tools (POS tag maps, word embeddings, etc.). It would not be limited to lemmatizer.

How do you want to proceed with the implementation ?

Should I go and start working on it in rmarronnier/lemmatizer, and we keep this issue opened until you're ok with the result and we can move it in cadmiumcr ?

watzon commented 5 years ago

Yeah let's go ahead with that. I'll created the languages repo. Let's do things in pretty much the same way Spacy is so that we don't have to fiddle with the data too much, but rather than having .json and.cr files let's do everything as json. Then we can write a script to automatically create a release .zip file for each language folder.

watzon commented 5 years ago

Ok having looked more closely at Spacy's languages it looks like a 1 for 1 might not be very feasible.

rmarronnier commented 5 years ago

Thanks for creating the languages repo.

I see the languages repo more like a pure data repository, I'm not convinced we should put code (.cr files) in there, even if specific to a language. For example, I have no problem with localized pragmatic tokenizers being present in cadmium_tokenizer. Maybe I'll change my mind when we have 10-15+ supported languages but for now, it looks overkill. And when we'll have that many (I hope soon:-)), we might be better off creating specific language shard as a convenience for the developers.

Anyway, I'll post here questions that arise while developing cadmium_lemmatizer.

watzon commented 5 years ago

Yeah I feel like everything they're doing in Spacy can be done better. Just plain json files will be fine. We just need to make everything consistent.

rmarronnier commented 5 years ago

Ok I thought stupidly I could make a lemmatizer without needing POS token info, well it's gonna be pretty limited :-p So, first, I'm designing a Cadmium::Token class with mostly POS, morphology attributes. For now, the token.cr is in the lemmatizer module, but maybe it should be elsewhere like Cadmium::Util where it could be used by the future POS Tagger. To summarize, I'll finish the lemmatizer and its needed Token class, but without a POS tagger, no token object can be created and so the lemmatizer can just use the lookup table with a raw string.

The POS tagger is going to be a huge beast to slay and I've not done enough research to make a clear and solid proposal. But feel free to flesh one out if you're more confident :-)

rmarronnier commented 5 years ago

I have a working lemmatizer here. I'll still need to work on it, especially when the POS_Tagger will be done, but the main logic is here. The Token struct will obviously grow and move elsewhere.

Are you ok with me creating the lemmatizer repo and move my code there ?

watzon commented 5 years ago

Go ahead :+1: Let's start getting the languages lib up and running soon and get the language data for the lemmatizer moved over there though.

cadmiumcr / cadmium