Strumenta / SmartReader

SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
https://smartreader.inre.me
Apache License 2.0
156 stars 35 forks source link

Adding support for automatic summarization #17

Open gabriele-tomassetti opened 4 years ago

gabriele-tomassetti commented 4 years ago

The library can extract any manual excerpt that is contained in the article (i.e., the short summary that usually is shown in Facebook or Twitter). However, it can be useful to also generate an automatic summary for long articles. The issue is that there does not seem to be nothing really effective and light on resources to do that. So, the end result may vary in quality.

Mochitto commented 1 month ago

An LLM could likely help with this. A simple solution is allowing users to configure the reader with their API key and just contact OpenAI API to get summaries.

gabriele-tomassetti commented 1 month ago

We should probably implement a basic interface to let people choose how to obtain the summary, like we do for converting to plain text. So, users can choose to use a LLM.

Just like the original library, this was designed for people that wanted a light and privacy-oriented solution to get an article free of clutter. So, I do not think that a LLM would be good a fit for integration in this library. To be fair, I never found a good way to do this algorithmically, hence why we should give users a simple way to do what they want.

Mochitto commented 1 month ago

A fair concern I totally agree with. I was also thinking about it being an opt-in and at discretion of users (since they're using their API keys).

An extraction based algorithm could be added, sorting out important parts of text after ranking the sentences, but quality would probably vary a lot and complexity spiral out of control, with i18n in mind.

Maybe there could be some nice text analysis visualization tools that could aid in skimming through the text more quickly, instead of creating summaries (even a simple highlight on the longest sentences, or of field-specific terms based on frequency scores).

Abstraction summarizations, with privacy in mind, could be implemented in some years, if we get local-based LLMs.