ComputerGhost / FaviconFetcher

Scan a webpage for favicons, or just easily download the one you want.
MIT License
5 stars 3 forks source link

Add information regarding importance of setting UserAgent to documentation #27

Open kiddailey opened 5 months ago

kiddailey commented 5 months ago

Description

A surprising number of sites return HTTP errors when attempting to scan or fetch. Twitter.com (X.com) in particular returns a 400 Bad Request. Some other sites return a 403 Forbidden error, quite often when being served through 3rd party providers like CloudFlare who have automation protection enabled or when requested through a VPN.

Issue

The result is from these type of requests is that only the default favicon.ico is queued regardless if there are alternatives available. There are cases where the favicon.ico exists but not in the standard location and because the request was blocked, we can't know the real location.

Many of these issues simply boil down to one thing: filtering against the UserAgent.

As previously noted in issue #12, some sites simply just blocked "Fetch" in the UserAgent. Others, like Twitter have more complicated rules. Twitter seems to require "Mozilla", the operating system, and some combination of browser spec that are unknown. For example:

Doesn't work with Twitter:

Works with Twitter:

When "Gecko" is specified alone, we can simply add Favicon's user agent to the end. If "AppleWebKit" is specified though, we must include the Chrome specifier "(KHTML, like Gecko) Chrome/114.0.0.0". It appears that a valid Chrome version must be used as well (ie. 1.0.0.0 won't work). There are other weird rules as well (e.g. when Windows, the revision number seems required for Gecko but not for AppleWebKit). Interestingly the Safari user agent's I tested with don't work at all.

Solution

I don't think the library needs to do any more than it already does -- allow the caller to define the UserAgent and leave the UserAgent as just "FaviconF3tcher/1.2".

However, it would be nice if some general information about this fact existed in the documentation. It's not obvious that this is an issue unless you step through the scanner/fetcher code itself and watch the http responses directly. In other words, it's impossible to deduce why you would be getting only the default favicon.ico from a calling app.

kiddailey commented 5 months ago

I'll work something up for this and submit a PR when I can.