codelucas / newspaper

newspaper3k is a news, full-text, and article metadata extraction in Python 3. Advanced docs:
https://goo.gl/VX41yK
MIT License
14.15k stars 2.12k forks source link

Why .ico and other types of icon are downloaded as main image? #543

Open monajalal opened 6 years ago

monajalal commented 6 years ago

Why would the API return .ico and other types of icons and logos as main image? Is there a way to avoid this from happening? Not all the time logos/icons are in ico format and sometimes in jpeg/png. What is your methodology to detect the main image of the article and where do you think this flaw is stemming from?

awiebe commented 6 years ago

Scraping different non machine readable websites is an inexact science, sometimes that means you'll get garbage. Newspaper is only as good as its ability to look for common patterns in various pages. If you don't want icos, I suggest you filter them out by checking for the extensions, and choosing another image Article.images. I agree that ico should probably not be the main image, so maybe a check should be added to this part of the algorithm, feel free to put in a pull request.

The requisite lines are in order:

Parsing https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L275

Smart setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L443

Logic which actually determines which image (Add your patch here) https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/images.py#L170

Dumb setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L449