adbar / simplemma

Simple multilingual lemmatizer for Python, especially useful for speed and efficiency
https://adrien.barbaresi.eu/blog/simple-multilingual-lemmatizer-python.html
MIT License
138 stars 12 forks source link

Use custom dictionaries #118

Closed 1over137 closed 1 month ago

1over137 commented 5 months ago

It would be nice if the API provided a way of loading a custom dictionary without resorting to patching the data in the module. In some languages, the lemmatizer coverage can be rather poor, and other languages are not supported at all. If this is welcome and we can agree on what the API should look like, I can implement this and make a PR. My idea would be passing a dict argument to the simplemma.lemmatize, or a global state that stores which extra dicts to use in each language and a few functions to manipulate it.

adbar commented 5 months ago

I prefer working towards releasing a version 1 and see from there, that includes documentating how the sources are compiled, I'm working on it.

The API is not completely stable right now as a few things are still broken after an intensive refactoring. I'd suggest you wait with your PR until things have stabilized a bit. Using the new classes to load external dictionaries seems like a good approach.

adbar commented 3 months ago

@1over137 You can start working on a PR if you want, the API for dictionary lookup strategy is stable. I also added info in the training readme on additional dictionaries.

juanjoDiaz commented 3 months ago

Hi guys,

Such API is already there. You just need to implemente the DictionaryFactory protocol and use it to load your custom dictionaries.

adbar commented 2 months ago

@1over137 Did that solve your problem or do we need to work on the documentation?

juanjoDiaz commented 1 month ago

Closing as this was answered. Feel free to reopen if there are more questions.