Anonyfox / meteor-scrape

Scrape any Website or RSS/Atom-Feed with ease.
GNU Lesser General Public License v3.0
71 stars 19 forks source link

Optimize Readability #10

Open Anonyfox opened 9 years ago

Anonyfox commented 9 years ago

The used readability module often returns garbage from the parsed HTML sites, which leads not only to unusable fulltext properties, but also to awful wrong matches in the tagging engine and sometimes crappy summary texts.

A custom optimized readability algorithm is needed, that is more accurate than the current implementation, and as fast as possible (<100ms on casual hardware and common websites).