Add support for custom trained PunktTokenizer in PreProcessor

tstadel commented 2 years ago

Discussed in https://github.com/deepset-ai/haystack/discussions/2773

^{Originally posted by **danielbichuetti** July 6, 2022} Hi, Today, the PreProcessor makes usages of NLTK PunktTokenizer. The default one is great, except for some specific domains. Like the legal one, where there are many abbreviations that it messes up a little. I would like to propose and offer myself to implement the possibility to set a custom trained PunktTokenizer for any set of languages. For example, user defines a directory with the file name using the ISO pattern where Haystack will then search for the language, if not found default to default one. What do you think about this feature ? It wouldn't interfere and just improve specific cases (which are many in NLP domain). Have a great day!

Is your feature request related to a problem? Please describe. Today, the PreProcessor makes usages of NLTK PunktTokenizer. The default one is great, except for some specific domains. Like the legal one, where there are many abbreviations that it messes up a little.

Describe the solution you'd like Introduce a parameter tokenizer_model_folder on PreProcessor which would represent a directory where custom models could be stored using ISO like:

pt.pickle
en.pickle

If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. Since pre-processing is a task for NLP that has a close connection to the domain of the text and the specific task, if anyone wants he could have a folder legal, another medical and so on. And when calling PreProcessor could setup a parameter model_folder=

Describe alternatives you've considered Another idea might be to support different sentence tokenizers in general. This would however be more time-consuming as anything else than passing primitives to nodes's init is discouraged as it doesn't work with YAML definitions.

Additional context @danielbichuetti did a test of the default models for NLTK, Spacy and Stanza. The best default model for portuguese (my scenario) was Stanza. But we could get lots of improvements on NLTK using PunktTrainer with a small corpus of legal documents with some abbreviations. These errors in split sentences probably occur in other domains that make usage of lots of dots inside sentences.

When making tests using GPT-3 which has a huge max token size @danielbichuetti got questions not being answered which were present in the text, just because of the bad sentence split. If it happens breaking the law fundaments (article) of a judicial decision, models won't be able to correctly infer. Or when it breaks the judge name and so on. On the law domain, these abbreviations often carry a very important information.

danielbichuetti commented 2 years ago

I would like to add this text as a simple reference where the default sentence tokenizer model does a poor job:

Direito civil. Ação de rescisão contratual cumulada com restituição de
valores pagos e reparação de danos materiais. Prequestionamento.
Ausência. Súmula 282/STF. Contrato de compra e venda de imóvel.
Alienação fiduciária em garantia. Código de Defesa do Consumidor, art.
53. Não incidência. 1. Ação de rescisão contratual cumulada com
restituição. A Lei nº 9.514/1997, que instituiu a alienação fiduciária de
bens imóveis, é norma especial e posterior ao Código de Defesa do
Consumidor – CDC. Em tais circunstâncias, o inadimplemento do
devedor fiduciante enseja a aplicação da regra prevista nos arts. 26 e 27
da lei especial” (REsp 1.871.911/SP, rel. Min. Nancy Andrighi, DJe
25/8/2020).

danielbichuetti commented 2 years ago

@tstadel While doing the first draft for this feature, I come up with one idea which might be useful. The current PR allow specifying a custom models folder. What if, after this, we could add the possibility of training the PunktTokenizer on Haystack Documents? The user could add some abbreviations that are more domain specific, and just ask Haystack to train based on his Documents. At the moment this is just an idea, but I imagine something like PreProcessor.train(documents, common_abbreviations, language, output_folder). And that output folder would be the input for the PreProcessor - tokenizer_model_folder.

Does this make any sense ?

masci commented 2 years ago

Can we close this now that #2783 has been merged?

deepset-ai / haystack

Add support for custom trained PunktTokenizer in PreProcessor #2780

Discussed in https://github.com/deepset-ai/haystack/discussions/2773