Open monajalal opened 6 years ago
Scraping different non machine readable websites is an inexact science, sometimes that means you'll get garbage. Newspaper is only as good as its ability to look for common patterns in various pages. If you don't want icos, I suggest you filter them out by checking for the extensions, and choosing another image Article.images
. I agree that ico should probably not be the main image, so maybe a check should be added to this part of the algorithm, feel free to put in a pull request.
The requisite lines are in order:
Smart setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L443
Logic which actually determines which image (Add your patch here) https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/images.py#L170
Dumb setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L449
Why would the API return .ico and other types of icons and logos as main image? Is there a way to avoid this from happening? Not all the time logos/icons are in ico format and sometimes in jpeg/png. What is your methodology to detect the main image of the article and where do you think this flaw is stemming from?