isProbablyReaderable - Githubissues

buriy / python-readability

fast python port of arc90's readability tool, updated to match latest readability.js!

https://github.com/buriy/python-readability

Apache License 2.0

2.67k stars 349 forks source link

isProbablyReaderable #174

Open Uzay-G opened 2 years ago

Uzay-G commented 2 years ago

How difficult would it be to implement isProbablyReaderable(doc, options) (from https://github.com/mozilla/readability#isprobablyreaderabledocument-options).

This would allow to check when a webpage is actually interesting / relevant for scraping and save on speed.

Would this be hard to implement? I could also try working on it.

buriy commented 2 years ago

It's not difficult to implement in that way, but I'm afraid you won't get any big improvement in parsing time (now typical article processing time is 0.1-0.4 s per page), nor it's reliable, or, to be more precise:

If you use minScore check, readability algorithm is completely the same but without cleaning phase, will take almost the same time.
If you could only check HTML, it's completely unreliable.

Uzay-G commented 2 years ago

Oh I see. What could I do to use readability to check if a webpage actually has like interesting content?

Where an actual article passes this check and something like the google homepage doesn't.

buriy commented 2 years ago

The main check should be whether there's something to read: text with length starting from 300 chars. Ideally, 500+ chars. You can check this after processinging with readability: just convert to text and check the length.