adamfabish / Reduction

Reduction is a python script which automatically summarizes a text by extracting the sentences which are deemed to be most important.
54 stars 19 forks source link

Reduction Question! #3

Open CleezyITP opened 9 years ago

CleezyITP commented 9 years ago

Hi there!

I think you reduction.py script is fabulous. I am using your code to help a blind woman summarize articles and have them readable by her screen reader. However, I am running into issues when I analyze/summarize large texts (4,000 + word articles). I can't figure out why this is happening. Is it the length or the unicode characters? Any help would be greatly appreciated.

thanks,

Claire

adamfabish commented 9 years ago

Hi Claire,

Thanks! I'm really pleased you've found such a great way to use the script.

The performance with large texts is due to the algorithm itself. It is an O(n^2) algorithm if you're familiar with the computer science terminology, which means as the input gets larger it takes exponentially longer to run.

Having said that, I don't see any reason why it shouldn't be able to run on ~4000 word texts.

I'll spend some time on it this weekend and try to improve the performance for large texts, and let you know what I come up with.

Cheers, Adam On 5 Mar 2015 04:58, "Claire K-V" notifications@github.com wrote:

Hi there!

I think you reduction.py script is fabulous. I am using your code to help a blind woman summarize articles and have them readable by her screen reader. However, I am running into issues when I analyze/summarize large texts (4,000 + word articles). I can't figure out why this is happening. Is it the length or the unicode characters? Any help would be greatly appreciated.

thanks,

Claire

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3.

CleezyITP commented 9 years ago

Absolutely Fantastic, Adam! I really appreciate you responding and looking into this. The .txt file that I tried to analyze is 4,647 words (25, 386 characters) and when I added a counter to "def buildGraph(sentences)" the file took over 5 minutes and a summary was never printed.

Your summarizing script is great compared to others that I have encountered, so far. I am meeting with a friend today to talk about adding a "chunking" function to the script so that it only analyzes portions of the text at one time.

Let me know what you find this weekend!

Claire

On Fri, Mar 6, 2015 at 5:33 AM, Adam Fabish notifications@github.com wrote:

Hi Claire,

Thanks! I'm really pleased you've found such a great way to use the script.

The performance with large texts is due to the algorithm itself. It is an O(n^2) algorithm if you're familiar with the computer science terminology, which means as the input gets larger it takes exponentially longer to run.

Having said that, I don't see any reason why it shouldn't be able to run on ~4000 word texts.

I'll spend some time on it this weekend and try to improve the performance for large texts, and let you know what I come up with.

Cheers, Adam On 5 Mar 2015 04:58, "Claire K-V" notifications@github.com wrote:

Hi there!

I think you reduction.py script is fabulous. I am using your code to help a blind woman summarize articles and have them readable by her screen reader. However, I am running into issues when I analyze/summarize large texts (4,000 + word articles). I can't figure out why this is happening. Is it the length or the unicode characters? Any help would be greatly appreciated.

thanks,

Claire

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3.

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3#issuecomment-77538534.

Claire Kearney-Volpe MPS Candidate, ITP (NYU)

adamfabish commented 9 years ago

Hi Claire,

I've fixed a bug that was causing very poor performance, as well as fixing some other bugs and refactoring. Feel free to try the latest version and see if it works better for you now, also note that I've changed how you use the code (see the README).

The way the algorithm works is by creating a graph where the vertices are the sentences in the text, and the edges connect every sentence to every other sentence and are weighted by the similarity of those two sentences. The sentences with the highest total weight are those that are deemed to be the most important, because they contain the most content in common with the rest of the text. It's important to know how a sentence relates to every other sentence in the text to work out how central it is to the text. For that reason I wouldn't recommend summarising a text in chunks - that would only tell you how important a sentence is in relation to the other sentences in that chunk.

If you have any issues with the new version just let me know.

Cheers, Adam

CleezyITP commented 9 years ago

Hi Adam,

Thanks for working on this! Embarrassingly, I am having trouble running the new script.

I put

"from reduction import * reduction = Reduction() text = open('test.txt').read() reduction_ratio = 0.5 reduced_text = reduction.reduce(text, reduction_ratio)"

at the top of the script and I am getting an error that Reduction() is not defined. Am I instantiating the class in the wrong way?

Thanks again,

Claire

On Sun, Mar 8, 2015 at 7:29 PM, Adam Fabish notifications@github.com wrote:

Hi Claire,

I've fixed a bug that was causing very poor performance, as well as fixing some other bugs and refactoring. Feel free to try the latest version and see if it works better for you now, also note that I've changed how you use the code (see the README).

The way the algorithm works is by creating a graph where the vertices are the sentences in the text, and the edges connect every sentence to every other sentence and are weighted by the similarity of those two sentences. The sentences with the highest total weight are those that are deemed to be the most important, because they contain the most content in common with the rest of the text. It's important to know how a sentence relates to every other sentence in the text to work out how central it is to the text. For that reason I wouldn't recommend summarising a text in chunks - that would only tell you how important a sentence is in relation to the other sentences in that chunk.

If you have any issues with the new version just let me know.

Cheers, Adam

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3#issuecomment-77781909.

Claire Kearney-Volpe MPS Candidate, ITP (NYU)

adamfabish commented 9 years ago

Hi Claire,

Try putting the reduction.py file in the same folder as your script.

Cheers, Adam On 26 Mar 2015 18:45, "Claire K-V" notifications@github.com wrote:

Hi Adam,

Thanks for working on this! Embarrassingly, I am having trouble running the new script.

I put

"from reduction import * reduction = Reduction() text = open('test.txt').read() reduction_ratio = 0.5 reduced_text = reduction.reduce(text, reduction_ratio)"

at the top of the script and I am getting an error that Reduction() is not defined. Am I instantiating the class in the wrong way?

Thanks again,

Claire

On Sun, Mar 8, 2015 at 7:29 PM, Adam Fabish notifications@github.com wrote:

Hi Claire,

I've fixed a bug that was causing very poor performance, as well as fixing some other bugs and refactoring. Feel free to try the latest version and see if it works better for you now, also note that I've changed how you use the code (see the README).

The way the algorithm works is by creating a graph where the vertices are the sentences in the text, and the edges connect every sentence to every other sentence and are weighted by the similarity of those two sentences. The sentences with the highest total weight are those that are deemed to be the most important, because they contain the most content in common with the rest of the text. It's important to know how a sentence relates to every other sentence in the text to work out how central it is to the text. For that reason I wouldn't recommend summarising a text in chunks - that would only tell you how important a sentence is in relation to the other sentences in that chunk.

If you have any issues with the new version just let me know.

Cheers, Adam

— Reply to this email directly or view it on GitHub <https://github.com/adamfabish/Reduction/issues/3#issuecomment-77781909 .

Claire Kearney-Volpe MPS Candidate, ITP (NYU)

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3#issuecomment-86352482.

CleezyITP commented 9 years ago

Got it! Thanks so much!

On Tue, Mar 31, 2015 at 12:19 AM, Adam Fabish notifications@github.com wrote:

Hi Claire,

Try putting the reduction.py file in the same folder as your script.

Cheers, Adam

On 26 Mar 2015 18:45, "Claire K-V" notifications@github.com wrote:

Hi Adam,

Thanks for working on this! Embarrassingly, I am having trouble running the new script.

I put

"from reduction import * reduction = Reduction() text = open('test.txt').read() reduction_ratio = 0.5 reduced_text = reduction.reduce(text, reduction_ratio)"

at the top of the script and I am getting an error that Reduction() is not defined. Am I instantiating the class in the wrong way?

Thanks again,

Claire

On Sun, Mar 8, 2015 at 7:29 PM, Adam Fabish notifications@github.com wrote:

Hi Claire,

I've fixed a bug that was causing very poor performance, as well as fixing some other bugs and refactoring. Feel free to try the latest version and see if it works better for you now, also note that I've changed how you use the code (see the README).

The way the algorithm works is by creating a graph where the vertices are the sentences in the text, and the edges connect every sentence to every other sentence and are weighted by the similarity of those two sentences. The sentences with the highest total weight are those that are deemed to be the most important, because they contain the most content in common with the rest of the text. It's important to know how a sentence relates to every other sentence in the text to work out how central it is to the text. For that reason I wouldn't recommend summarising a text in chunks - that would only tell you how important a sentence is in relation to the other sentences in that chunk.

If you have any issues with the new version just let me know.

Cheers, Adam

— Reply to this email directly or view it on GitHub < https://github.com/adamfabish/Reduction/issues/3#issuecomment-77781909 .

Claire Kearney-Volpe MPS Candidate, ITP (NYU)

— Reply to this email directly or view it on GitHub <https://github.com/adamfabish/Reduction/issues/3#issuecomment-86352482 .

— Reply to this email directly or view it on GitHub https://github.com/adamfabish/Reduction/issues/3#issuecomment-87930868.

Claire Kearney-Volpe MPS Candidate, ITP (NYU)