code-kern-ai / bricks

Open-source natural language enrichments at your fingertips.
Apache License 2.0
451 stars 23 forks source link

[MODULE] - Smalltalk Truncation #124

Open divyanshukatiyar opened 1 year ago

divyanshukatiyar commented 1 year ago

Description This module removes all the irrelevant text from a passage or chat, and return the possible relevant information.

Implementation

import re
from nltk.corpus import stopwords

sw = stopwords.words('english')
text = '''"Hello, Diana".
"Hello, Phill. how are you doing?"
"I am doing fine, and you?". 
"I am doing good as well!". 
"Listen, I wanted to talk to you about the divorce papers our lawyer set up. Basically, he needs our signatures on the documents set up.".
"The document says that you accept to pay me a monthly alimony of 2000 dollars and that I will have the custody of our child.".
"Have a nice day."'''
regex = re.compile(r"\".*?\"")

for message in regex.findall(text):
    chat = message.replace('"','')
    chat = chat.split()
    new_text = []
    stop_words = []
    for token in chat:
        if token not in sw:
            new_text.append(token)
        else:
            stop_words.append(token)
    if (len(new_text) > 0.5*len(chat) or len(stop_words) > 8) and not len(chat) < 3: 
        print(" ".join(chat))

Additional context Add any other context or screenshots about the feature request here.

LeonardPuettmann commented 1 year ago

The module uses simple regex to extract sentences. It would be better to use NLTK for this.

For example like this:

import nltk

sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')
sent_detector.tokenize(text.strip())

Otherwise, sentences like "This is Dr. Smith and he lives in London." would get chopped up.

SvenjaKern commented 1 year ago

I found a small issue here: →refinery: change ATTRIBUTE = “text” →localhost: Error: Unprocessable Entity