Why .ico and other types of icon are downloaded as main image?

Scraping different non machine readable websites is an inexact science, sometimes that means you'll get garbage. Newspaper is only as good as its ability to look for common patterns in various pages. If you don't want icos, I suggest you filter them out by checking for the extensions, and choosing another image Article.images. I agree that ico should probably not be the main image, so maybe a check should be added to this part of the algorithm, feel free to put in a pull request.

The requisite lines are in order:

Parsing https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L275

Smart setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L443

Logic which actually determines which image (Add your patch here) https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/images.py#L170

Dumb setting https://github.com/codelucas/newspaper/blob/c521057b20bb3d4cd27d8b0ee6efd64d1d3a488f/newspaper/article.py#L449

codelucas / newspaper

Why .ico and other types of icon are downloaded as main image? #543