Collection of content - Githubissues

SotirisFtiakas / Search-Engine-Creation

This is a fully functional search engine, with a crawler , inverted indexer, and query processor. It also supports user feedback, refreshing results based on the pages that the user found useful.

0 stars 0 forks source link

Collection of content #4

Open GregB712 opened 3 years ago

GregB712 commented 3 years ago

We need a better way to collect the main part of each page (preferable with usage of Beautiful Soup package). We currently use the library trafilatura .

adbar commented 3 years ago

Hi @GregB712, I just stambled upon this page while looking for issues related to the library. I'm the developer of trafilatura, may I ask on which kind of pages the main texts were extracted properly? I'm curious to find bugs and trickier examples...