gaffling / PHP-Grab-Favicon

🖼 Saves the favicon of the given URL and returns the image path.
http://suchmaschine.biz
MIT License
25 stars 6 forks source link

Roadmap: More Enhancements in Development #9

Open LeeThompson opened 1 year ago

LeeThompson commented 1 year ago

Status:

June 23rd 2023 Haven't been able to do much work this week due to some unexpected household emergencies, should be back at it next week.

202306161401

202306121848:

Some notes on this:

Having our own image identification is important should the PHP installation be limited (for whatever reason) and going by file extension is still the last resort.

The method used for this is looking for the "signature" of the image file. Most image formats have a header with signature data to be used by software trying to open it (this is also called a "magic number".) The new code knows PNG, GIF, JPEG, WEBP, BMP and ICO formats.

Some image formats are easier to identify than others, for example PNG format's "magic" which is \x89PNG\r\n\x1A\n which is pretty good. BMP and ICO have very very simple identifiers and so having false positives is much more likely which is why I've been adding a "certainty" rating. Eventually you'll be able to set a minimum acceptable "certainty" and reject possibly invalid files. (You can currently set it but nothing looks at it.)

Here's some sample trace logging showing this in action:

2023-06-12 18:47:21 [TRACE] [grap_favicon(20):listIcons:getMIMETypeFromFile] pathname='icons/whatsapp.png', content_type=image/png, confidence=certain, method=signature

Ideally, if everything is available to get-fav.php the following methods are used, in order:

  1. The content-type returned by the server (remote only)
  2. FileInfo
  3. mime_content_type (local files only)
  4. exif_imagetype (and image_type_to_mime_type if available)
  5. getMIMETypeFromBinary (the new fallback function using "magic")
  6. file extension

202306071311:

It will likely be a few days before I do another git push as the next one is a big one:

202306062230:

202306042323:

202306021445:

202306011529:

202305312016:

202305281757:

202305251719:

202305242106:

202305241803:

202305241420:

202305221634:

202305231619:

Stuff being worked on:

(I'm keeping my github fork up to date as I work on stuff, assuming it's not throwing horrible errors.)

Issues:

Before pull request:

Other Tasks:

Notes:

LeeThompson commented 1 year ago

--help output as of 202306011529

Usage: get-fav.php (Switches)

Available APIs: faviconkit, favicongrabber, google, iconhorse (get-fav-api.ini)
Lists can be separated with space, comma or semi-colon.

--configfile=FILE           Pathname to read for configuration.
--list=FILE/LIST            Pathname or a delimited list of URLs to check.
--blocklist=FILE/LIST       Pathname or a delimited list of MD5 hashes to block.
--validtypes=FILE/LIST      Valid icon types (default is gif,webp,png,ico,bmp,svg,jpg)
--logfile=FILE              Pathname for log file (default is get-fav.log)
--path=PATH                 Location to store icons (default is ./)
--size=NUMBER               Try to get icon size (default is 16)

--tryhomepage               Try homepage first, then APIs. (default is true)
--onlyuseapis               Only use APIs.
--disableapis               Don't use APIs.
--enableblocklist           Enable blocklist. (default is true)
--disableblocklist          Disable blocklist.
--store                     Store favicons locally. (default is true)
--nostore                   Do not store favicons locally.
--overwrite                 Overwrite local favicons. (default is false)
--skip                      Skip local favicons.
--removetld                 Remove top level domain from filename. (default is false)
--noremovetld               Don't remove top level domain from filename.
--tenacious                 Try all enabled APIs until success. (default is false)
--notenacious               Try a random API.
--allowoctetstream          Allow MimeType 'application/octet-stream'. (default is false)
--disallowoctetstream       Block MimeType 'application/octet-stream' for icons.
--consolemode               Force console output.
--noconsolemode             Force HTML output.
--debug                     Enable debug mode.
--help                      This listing and exit.
--version                   Show version and exit.

Advanced:
--user-agent=AGENT_STRING   Customize the user agent.
--nocurl                    Disable cURL.
--bufferhttp                Buffer HTTP page loading. (default is true)
--nobufferhttp              Disable HTTP page load buffering.
--curl-verbose              Enable cURL verbose.
--curl-progress             Enable cURL progress bar.
--enableapis=FILE/LIST      Filename or a delimited list of APIs to enable.
--disableapis=FILE/LIST     Filename or a delimited list of APIs to disable.
--http-timeout=SECONDS      Set HTTP timeout. (default is 60).
--connect-timeout=SECONDS   Set HTTP connect timeout. (default is 30).
--dns-timeout=SECONDS       Set DNS lookup timeout. (default is 120).

Logging:
--log                       Enable debug logging. (default is false)
--nolog                     Disable debug logging.
--append                    Append debug log. (default is true)
--noappend                  Always overwrite debug log.
--timestamp                 Enable debug log timestamps. (default is true)
--notimestamp               Do not show timestamps in debug log.
--loglevel=NUMBER           Set debug logging level. (default is 255)

Console:
--level=NUMBER              Set debug logging level. (default is 31)
--showtimestamp             Enable debug log timestamps. (default is false)
--hidetimestamp             Do not show timestamps in debug log.

Notes:

LeeThompson commented 1 year ago

Configuration Files Use INI file format. Each value is optional. Comments can be used "; " etc. Complex strings need to be quoted. (See the useragent entry below).

[files]
overwrite=true
store=true
local_path=./

[http]
try_homepage=true
http_timeout=60
useragent="Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/113.0"

[curl]
enabled=true

[global]
debug=true
LeeThompson commented 1 year ago

Notes on the blocklist concept:

This is already done in get-fav with the google API and the default icon. This simply allows a list of md5 hashes of other icons for the program to ignore.

LeeThompson commented 1 year ago

get-fav-api.ini format

Field Description
name ID of the definition (used for enable/disable)
display Cosmetic Display Name (defaults to name)
url API URL (if it contains = and certain other characters it needs to be quoted)
json Does API return json format?
apikey Does the API require a key? (not tested)
enabled Is this definition enabled?

If a json structure is used, it is defined as follows with "json_structure[field] = "item" in the section, for example:

json_structure[icons] = "icons"
json_structure[link] = "src"
json_structure[sizeWxH] = "sizes"
json_structure[mime] = "type"
json_structure[error] = "error"

Supported Fields are (so far):

Sample:

;
; PHP-Grab-Favicon
; APIs
;

[faviconkit]
display=FavIconKit
name=faviconkit
url=https://api.faviconkit.com/<DOMAIN>/<SIZE>
json=false
enabled=true

[favicongrabber]
display=FavIconGrabber
name=favicongrabber
url=http://favicongrabber.com/api/grab/<DOMAIN>
json=true
enabled=true
json_structure[icons] = "icons"
json_structure[link] = "src"
json_structure[sizeWxH] = "sizes"
json_structure[mime] = "type"
json_structure[error] = "error"

[google]
display=Google
name=google
url="http://www.google.com/s2/favicons?domain=<DOMAIN>&sz=<SIZE>"
json=false
enabled=true

[iconhorse]
display=Icon Horse
name=iconhorse
url=https://icon.horse/icon/<DOMAIN>
json=false
enabled=true
LeeThompson commented 1 year ago

Debug Log File Information

Define Value Description
TYPE_ALL 1 Should always be output
TYPE_NOTICE 2 Important information
TYPE_WARNING 4 Potential issue
TYPE_VERBOSE 8 Extra information
TYPE_ERROR 16 Something has gone wrong
TYPE_DEBUGGING 32 Debug message, usually tops of functions
TYPE_TRACE 64 Extra debug messaging, usually sub/helper functions
TYPE_SPECIAL 128 Special debug messaging, usually sub/helper functions

The "shipping" default is 31 which is all bug debug and trace.

The timestamp, by default uses Y-m-d H:i:s which looks like 2023-05-25 17:27:39. There isn't a switch to change it but it can be changed in the .ini file:

The default log separator used if it is appending to an existing log file is 80 *'s. This cannot be changed via a switch but can also be changed in the .ini file.

[logging]
timestampformat="Y-m-d H:i:s"
separator=(whatever)

Switches:

Files: Switch Description
--loglevel=NUMBER Log level to use, for everything generally you want 255
--logfile=FILE Pathname for log file (default is get-fav.log)
--log / --nolog Enable/Disable Log File
--append / --noappend Enable/Disable Appending the Log File
--timestamp / --notimestamp Use Timestamps in Log FIle or Not
Console: Switch Description
--level=NUMBER Log level to use, for everything generally you want 255
--showtimestamp / --hidetimestamp Use Timestamps on Console

Configuration Options:

[logging]
enabled=true/false
append=true/false
level=value
pathname=filename or full path
separator=separator to use when appending
timestamp=true/false
timestampformat="Y-m-d H:i:s"

[console]
enabled=true/false
level=value
timestamp=true/false
timestampformat="Y-m-d H:i:s"

Notes:

LeeThompson commented 1 year ago

Proposed Web Variables

Variable Internal/INI File Switch Comments
GETFAVDEBUG debug --debug Enables special debug mode