Closed tstadel closed 2 years ago
I would like to add this text as a simple reference where the default sentence tokenizer model does a poor job:
Direito civil. Ação de rescisão contratual cumulada com restituição de
valores pagos e reparação de danos materiais. Prequestionamento.
Ausência. Súmula 282/STF. Contrato de compra e venda de imóvel.
Alienação fiduciária em garantia. Código de Defesa do Consumidor, art.
53. Não incidência. 1. Ação de rescisão contratual cumulada com
restituição. A Lei nº 9.514/1997, que instituiu a alienação fiduciária de
bens imóveis, é norma especial e posterior ao Código de Defesa do
Consumidor – CDC. Em tais circunstâncias, o inadimplemento do
devedor fiduciante enseja a aplicação da regra prevista nos arts. 26 e 27
da lei especial” (REsp 1.871.911/SP, rel. Min. Nancy Andrighi, DJe
25/8/2020).
@tstadel While doing the first draft for this feature, I come up with one idea which might be useful. The current PR allow specifying a custom models folder. What if, after this, we could add the possibility of training the PunktTokenizer on Haystack Documents? The user could add some abbreviations that are more domain specific, and just ask Haystack to train based on his Documents.
At the moment this is just an idea, but I imagine something like PreProcessor.train(documents, common_abbreviations, language, output_folder)
. And that output folder would be the input for the PreProcessor - tokenizer_model_folder
.
Does this make any sense ?
Can we close this now that #2783 has been merged?
Discussed in https://github.com/deepset-ai/haystack/discussions/2773
Is your feature request related to a problem? Please describe. Today, the PreProcessor makes usages of NLTK PunktTokenizer. The default one is great, except for some specific domains. Like the legal one, where there are many abbreviations that it messes up a little.
Describe the solution you'd like Introduce a parameter
tokenizer_model_folder
on PreProcessor which would represent a directory where custom models could be stored using ISO like:If for that specific language, a model is present on this folder, PreProcessor would use it. If not, fallback to default one. Since pre-processing is a task for NLP that has a close connection to the domain of the text and the specific task, if anyone wants he could have a folder legal, another medical and so on. And when calling PreProcessor could setup a parameter
model_folder=
Describe alternatives you've considered Another idea might be to support different sentence tokenizers in general. This would however be more time-consuming as anything else than passing primitives to nodes's init is discouraged as it doesn't work with YAML definitions.
Additional context @danielbichuetti did a test of the default models for NLTK, Spacy and Stanza. The best default model for portuguese (my scenario) was Stanza. But we could get lots of improvements on NLTK using PunktTrainer with a small corpus of legal documents with some abbreviations. These errors in split sentences probably occur in other domains that make usage of lots of dots inside sentences.
When making tests using GPT-3 which has a huge max token size @danielbichuetti got questions not being answered which were present in the text, just because of the bad sentence split. If it happens breaking the law fundaments (article) of a judicial decision, models won't be able to correctly infer. Or when it breaks the judge name and so on. On the law domain, these abbreviations often carry a very important information.