This repository contains a very simple imlementation of extractive text summarization. The implemented summarizer was partially implemented from this paper (without adding a boost factor) - https://pdfs.semanticscholar.org/2df1/595bcbee37de1147784585a097f3a2819fdf.pdf
The code for summarizer service can be found in service
folder. After creating the service, this project was hosted as a flask API.
weight = (frequency of that term)/(total number of terms)
n
highest ranked sentences weight = (frequency of that word)/(total number of terms)
n
highest ranked sentencesRead a text document, or ask input from user. Here we create a function which would take the input text and return the summarized text. The second argument to the function is the number of high scored senteces which you want to extract
def summarize_text (text, num_sent):
...
...
return summary
Steps and code as shown below -
def preprocess (text):
# Convert to lower case
clean_text = text.lower()
# Remove special characters
clean_text = re.sub (r"\W", " ", clean_text)
# Remove digits
clean_text = re.sub (r"\d", " ", clean_text)
# Remove all the extra spaces with a single space
clean_text = re.sub (r"\s+", " ", clean_text)
# return the clean text
return clean_text
We use sent_tokenize()
provided by nltk
library
sentences = nltk.sent_tokenize (text)
Again, we use nltk
stop_words = nltk.corpus.stopwords.words('english')
word_count_dict = {}
for word in nltk.word_tokenize(clean_text):
if word not in stop_words:
if word not in word_count_dict.keys():
word_count_dict[word] = 1
else:
word_count_dict[word] += 1
# Find the total number of terms (not necessarily unique) = sum of values in the word_count_dict
total_terms = sum(word_count_dict.values)
# Normalize the word-frequency dictionary (weighted word count matrix/dictionary)
max_value = max(word_count_dict.values())
for key in word_count_dict.keys():
word_count_dict[key] = word_count_dict[key]/total_terms
sentence_score_dict = {}
for sentence in sentences:
for word in nltk.word_tokenize(sentence.lower()):
if word in word_count_dict.keys():
if len(sentence.split(' ')) < 25: # 25 taken at random, to remove very long sentences
if sentence not in sentence_score_dict.keys():
sentence_score_dict[sentence] = word_count_dict[word]
else:
sentence_score_dict[sentence] += word_count_dict[word]
n
highest ranked sentencesbest_sentences = heapq.nlargest(num_sent, sentence_score_dict, key=sentence_score_dict.get)