code-kern-ai / bricks

Open-source natural language enrichments at your fingertips.
Apache License 2.0
451 stars 23 forks source link

[MODULE] - Sentence extraction #54

Open jhoetter opened 2 years ago

jhoetter commented 2 years ago

Please describe the module you would like to add to the content library I have one large paragraph which contains multiple sentences, which I want to detect

Do you already have an implementation? -

Additional context Use spaCy or something like detectormorse for this

LeonardPuettmann commented 2 years ago

Would be possible to use NTLK for this.

import nltk

text = "..."

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = ' '.join(sent_detector.tokenize(text.strip()))
jhoetter commented 2 years ago

I think spacy offers something similar. Since refinery uses spacy under the hood, I'd recommend building a spacy-based sentence tokenizer first :)

SvenjaKernAi commented 1 year ago

there is also a Bert Sentence Detector if I remember correctly https://huggingface.co/sentence-transformers

jhoetter commented 1 year ago

In refinery 2.0/cognition, it will be really interesting to detect sentences even if they are rather complex, since this allows us to create better chunks for RAG (embedding lists)