MAIF / melusine

📧 Melusine: Use python to automatize your email processing workflow
https://maif.github.io/melusine
Other
352 stars 58 forks source link

Modernize the Melusine tokenizer #102

Closed hugo-quantmetry closed 2 years ago

hugo-quantmetry commented 3 years ago

Description of Problem: Depending on the context, tokenization may cover different functionalities. For exemple:

Tokenization in Melusine is currently a hybrid which covers the following functionalities:

It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).

This issue can be split into a few questions:

Overview of the Solution: I suggest to create a revamped MelusineTokenizer class with its load and save method. The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:

The tokenizer could be saved and loaded from a human readable "json" file.

Examples:

tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")

Definition of Done: The new tokenizer class works fine. The tokenizer can be read from / saved into a human readable config file The tokenizer centralizes all tokenization functionalities in the larger sens.