Open jhoetter opened 2 years ago
Would be possible to use NTLK for this.
import nltk
text = "..."
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = ' '.join(sent_detector.tokenize(text.strip()))
I think spacy offers something similar. Since refinery uses spacy under the hood, I'd recommend building a spacy-based sentence tokenizer first :)
there is also a Bert Sentence Detector if I remember correctly https://huggingface.co/sentence-transformers
In refinery 2.0/cognition, it will be really interesting to detect sentences even if they are rather complex, since this allows us to create better chunks for RAG (embedding lists)
Please describe the module you would like to add to the content library I have one large paragraph which contains multiple sentences, which I want to detect
Do you already have an implementation? -
Additional context Use spaCy or something like detectormorse for this