Description of Problem:
Depending on the context, tokenization may cover different functionalities. For exemple:
Gensim (gensim.utils.tokenize) : Tokenization is limited to splitting text into tokens
HuggingFace tokenizers (encode methods) : Full NLP tokenization pipeline including Text normalization, Pre-tokenization, Tokenizer model and Post-processing.
Tokenization in Melusine is currently a hybrid which covers the following functionalities:
Splitting
Stopwords removal
Name flagging
It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).
This issue can be split into a few questions:
How can we refactor the code to make the Full Tokenization pipeline standout ?
How can we easily configure the Tokenization pipeline ? (Ex: a user friendly and readable tokenizer.json file)
How can package the tokenizer to ensure repeatability ?
Overview of the Solution:
I suggest to create a revamped MelusineTokenizer class with its load and save method.
The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:
Text cleaning ?
Flagging (phone numbers, email addresses, etc)
Splitting
Stopwords removal
The tokenizer could be saved and loaded from a human readable "json" file.
Examples:
tokenizer = MelusineTokenizer(tokenizer_regex, stopwords, flags)
tokens = tokenizer.tokenize("Hello John how are you")
tokenizer.save("tokenizer.json")
tokenizer_reloaded = MelusineTokenizer.load("tokenizer.json")
Definition of Done:
The new tokenizer class works fine.
The tokenizer can be read from / saved into a human readable config file
The tokenizer centralizes all tokenization functionalities in the larger sens.
Description of Problem: Depending on the context, tokenization may cover different functionalities. For exemple:
Tokenization in Melusine is currently a hybrid which covers the following functionalities:
It seems to me that the Full NLP tokenization pipeline is a bit spread across the Melusine package (prepare_data.cleaning, nlp_tools.tokenizer and even the prepare_data method of the models.train.NeuralModel).
This issue can be split into a few questions:
Overview of the Solution: I suggest to create a revamped MelusineTokenizer class with its load and save method. The class should neatly package many functionalities commonly found in a "Full NLP Tokenization pipeline" such as:
The tokenizer could be saved and loaded from a human readable "json" file.
Examples:
Definition of Done: The new tokenizer class works fine. The tokenizer can be read from / saved into a human readable config file The tokenizer centralizes all tokenization functionalities in the larger sens.