Open Uzay-G opened 2 years ago
It's not difficult to implement in that way, but I'm afraid you won't get any big improvement in parsing time (now typical article processing time is 0.1-0.4 s per page), nor it's reliable, or, to be more precise:
Oh I see. What could I do to use readability to check if a webpage actually has like interesting content?
Where an actual article passes this check and something like the google homepage doesn't.
The main check should be whether there's something to read: text with length starting from 300 chars. Ideally, 500+ chars. You can check this after processinging with readability: just convert to text and check the length.
How difficult would it be to implement
isProbablyReaderable(doc, options)
(from https://github.com/mozilla/readability#isprobablyreaderabledocument-options).This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed.
Would this be hard to implement? I could also try working on it.