fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application
https://selfoss.aditu.de
GNU General Public License v3.0
2.37k stars 343 forks source link

link rel="SHORTCUT ICON" not recognized #1122

Closed Themroc closed 4 years ago

Themroc commented 5 years ago

Some feeds lack a <channel> <image> tag, but do provide a <channel><link>, where usually a favicon can be found. Would be nice if that'd be used. Now, unfortunately, many pages still use the traditional .ico format, which i guess would cause lots of trouble. So, lets convert it to .png. Here

https://www.phpclasses.org/package/3906-PHP-Read-and-write-images-from-ICO-files.html

is a class to do that, but it comes with a special license. I'd do the coding, but would you accept such a PR? The license reads:


  • Original floIcon copyright (C) 2007 by Joshua Hatfield. *
  • All Rights Reserved *


                       floIcon License
    
                  Program & Concept created by

Joshua Hatfield 2368 Franklin St. Augusta, GA 30906 USA (email flobi@flobi.com)

This document contains the rules by which you can use, alter or publish parts of floIcon. floIcon has been created by Joshua Hatfield in his spare time. You are legally bound to follow the rules described in this document.

Rules:

!! floIcon is NOT Public Domain, shareware, careware or the like !!

You may under no circumstances make profit on ANY part of floIcon in any possible way. You may under no circumstances charge money for distributing any part of floIcon - this includes the usual $5 charge for "sending the disk" or "just for the disk" etc. By breaking these rules you violate this agreement, and hence will be sued.

You may not remove any copyright notices from any of the documents or sources given to you.

This license must always be included "as is" if you copy or give away any part of floIcon (which is to be done as described in this document).

If you publish any part of floIcon, I, as creator, must appear in the article, and the article must be clearly copyrighted subject to this license. Before publishing you must first send me a message, by snail-mail or e-mail, and inform me what, where and when you are publishing (remember to include your address, name etc.)

Any running version of floIcon must include my name in the login and/or intitial connection sequence. Furthermore the if there is a credit section or command in any application using floIcon, it shall always contain my name, addresses, and a notice which states I have created floIcon.

You are allowed to alter floIcon, source and documentation as long as you do not violate any of the above stated rules.

Regards,

Joshua Hatfield

Note:

I hope you will enjoy floIcon, and encourage you to send me any reports on bugs (when you find 'it'). Remember that I am using my spare time to write and improve floIcon, bugs, etc. - and changes will take their time. I have so far put a number of programming hours into this project. If you make any major improvements on floIcon I would be happy to hear from you. As you will naturally honor the above rules, you will receive new updates and improvements made to the class.

jtojnar commented 5 years ago

Some feeds lack a <channel> <image> tag, but do provide a <channel><link>, where usually a favicon can be found. Would be nice if that'd be used.

I believe we actually prefer the favicon of <link> to <image>:

https://github.com/SSilence/selfoss/blob/19a173afce43a6bc8e400c8a82607ac5a560cef9/spouts/rss/feed.php#L230-L233

Now, unfortunately, many pages still use the traditional .ico format, which i guess would cause lots of trouble. So, lets convert it to .png.

We are already doing so using https://github.com/lordelph/icofileloader:

https://github.com/SSilence/selfoss/blob/19a173afce43a6bc8e400c8a82607ac5a560cef9/helpers/Image.php#L127-L156

Here

https://www.phpclasses.org/package/3906-PHP-Read-and-write-images-from-ICO-files.html

is a class to do that, but it comes with a special license.

We actually used floIcon before but had to switch to icofileloader due to floIcon’s non-free license and unavailability on composer.

Do you have a specific feed that does not have an icon displayed as expected?

Themroc commented 5 years ago

Wow! Great, that that should have been solved already. However, this

https://www.photovoltaik.eu/gentner.dll/RSS?MID=30021&RSS_STYLEID=574806

feed shows up like this:

Untitled

Themroc commented 5 years ago

How to get output from

\F3::get('logger')->debug()

?

jtojnar commented 5 years ago

You can set the logger_level in config.ini. You can also change DEBUG to 1 in common.php to record HTTP request bodies.

Themroc commented 5 years ago

Thanx! Gonna find out, where it gets stuck...

Themroc commented 5 years ago

Ok, their homepage says

< link rel="SHORTCUT ICON" href="https://www.photovoltaik.eu/gentner/photovoltaik/images/favicon2013.ico" / >

but the log says

2019-08-06 16:58:12] selfoss.ERROR: failed to retrieve image http://www.photovoltaik.eu/favicon.ico, {"exception":"[object] (GuzzleHttp\Exception\ClientException(code: 404): Cl (truncated...)

Makes me guess the code doesn't actually read the homepage. A file_get_contents() and preg_match() or 2 should fix that...

Themroc commented 5 years ago

Nope, all done already! Double wow! Only it doesn't ignore upper/lower case...

jtojnar commented 5 years ago

You are right, I must have broken it in 6e0494c229803674499c634a3e3b2a4906add78f

Themroc commented 5 years ago

:) Looks like the fewer php extensions you use, the better off you are...

Now, i found this https://stackoverflow.com/questions/24183701/xpath-lowercase-is-there-xpath-function-to-do-this but dunno how to implement it in php... You like to do it, or should i try?

jtojnar commented 5 years ago

I found https://stackoverflow.com/questions/2893551/case-insensitive-matching-in-xpath and even more interesting https://medium.com/simplepie-ng/php-domdocument-xpath-1-0-case-insensitivity-and-performance-ad962b98e71c

If you could take a jab at it, I would be thankful.

jtojnar commented 5 years ago

By the way, I also noticed that the feed displays umlauts incorrectly. Apparently, https://github.com/fossar/guzzle-transcoder which we use does not handle */*+xml MIME types. I am working on a fix for that.

Themroc commented 5 years ago

Thanks for the links, really interesting. However, it seems to me as if performance doesn't really matter anymore once you use xpath for this. I've done some tests with the old parseShortcutIcon() (the one before 6e0494c), the current xpath-one and this

    function parseShortcutIcon ($html) {
        if (preg_match('!<link\b[^>]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links) < 1)
            return false;
        if (preg_match('!\shref=("|\')(.+)\1!iU', $links[0], $href) < 1)
            return false;

        return $href[2];
    }

new one. With 200 runs on 15 pages i get:

old parseShortcutIcon(): Time: 0.971s 12 favicons ok

current parseShortcutIcon(): Time: 45.901s 14 favicons ok

new parseShortcutIcon(): Time: 0.144s 15 favicons ok

Pages were

$urls= [
    'https://taz.de',
    'https://www.sueddeutsche.de',
    'https://www.neues-deutschland.de',
    'https://www.klimareporter.de',
    'https://fridaysforfuture.de',
    'https://www.spiegel.de',
    'https://medium.com',
    'https://www.php.net',
    'https://stackoverflow.com',
    'https://github.com',
    'http://www.chris-floyd.com',
    'https://www.antiwar.com',
    'https://www.wsws.org',
    'https://caitlinjohnstone.com',
    'https://www.photovoltaik.eu',
];

What do you think?

Themroc commented 5 years ago

That was with PHP 7.0.33. PHP 5.6.30 gave the same outcome, but a bit different times:

Time: 5.673s Time: 45.919s Time: 0.341s

jtojnar commented 5 years ago

Hmm, that is quite significant performance regression, I guess we have to revert to the regex parsing :disappointed:

I am not sure about the single regex, though, as that approach will use the first matched (possibly a shortcut icon), instead of the apple-touch-icon that is likely a higher resolution. Maybe a compromise with one regex for apple-touch-icon icon and a second one for (shortcut )?icon would work?

jtojnar commented 5 years ago

For completeness, here are my results of the benchmark:

string(18) "parseShortcutIcon1"
float(0.80524301528931)
string(18) "parseShortcutIcon2"
float(22.608789920807)
string(18) "parseShortcutIcon3"
float(0.12911605834961)
Source code ```php ]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links) < 1) return false; if (preg_match('!\shref=("|\')(.+)\1!iU', $links[0], $href) < 1) return false; return $href[2]; } function parseShortcutIcon2($html) { $dom = new \DomDocument(); if (@$dom->loadHTML($html) !== true) { return null; } $xpath = new \DOMXPath($dom); $elems = $xpath->query("//link[@rel='apple-touch-icon']"); if ($elems->length === 0) { $elems = $xpath->query("//link[@rel='shortcut icon']"); } if ($elems->length === 0) { $elems = $xpath->query("//link[@rel='icon']"); } if ($elems->length > 0) { $icon = $elems->item(0); if ($icon->hasAttribute('href')) { return $icon->getAttribute('href'); } } return null; } function parseShortcutIcon1($html) { $result = preg_match('//Ui', $html, $match1); if ($result == 0) { $result = preg_match('/]*rel=("|\')shortcut icon\1.*>/Ui', $html, $match1); } if ($result == 0) { $result = preg_match('/]*rel=("|\')icon\1.*>/Ui', $html, $match1); } if ($result > 0) { $result = preg_match('/href=("|\')(.+)\1/Ui', $match1[0], $match2); if ($result > 0) { return $match2[2]; } } return false; } for ($i=1; $i <= 3; $i++) { $function = "parseShortcutIcon$i"; $start = microtime(true); for ($t=0; $t < 200; $t++) { foreach ($texts as $text) { $function($text); } } $end = microtime(true); var_dump($function, $end - $start); } ```
Themroc commented 5 years ago

AFAICT, those icons are displayed with only 16x16 Pixels anyway, so the 1st one should be sufficient. But it would be easy to use preg_match_all(). This one

    function parseShortcutIcon2($html) {
        $end= strripos($html, '</head>');
        if ($end > 1)
            $html= substr($html, 0, $end);

        if (preg_match_all('!<link\b[^>]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links) < 1)
            return null;

        //TODO: sort array descending by sizes-attribute

        foreach ($links[0] as $l) {
            if (preg_match('!\shref=("|\')(.+)\1!iU', $l, $href))
                return $href[2];
        }

        return null;
    }

takes 2.373s (old one: 0.975s). Still way faster than xpath.

Themroc commented 5 years ago

With php 5 it is

parseShortcutIcon0: Time: 5.651s parseShortcutIcon2: Time: 7.298s

Same league, i'd say.

Themroc commented 5 years ago

A more elegant (and fully working) variant of the old one:

    function parseShortcutIcon4($html) {
        if ((preg_match('/<link\b[^>]*\brel=("|\')apple-touch-icon\1[^>]*>/Ui', $html, $match)
            || preg_match('/<link\b[^>]*\brel=("|\')shortcut icon\1[^>]*>/Ui', $html, $match)
            || preg_match('/<link\b[^>]*\brel=("|\')icon\1[^>]*>/Ui', $html, $match)
            )
            && preg_match('/href=("|\')(.+)\1/Ui', $match[0], $result)
        )
            return $result[2];

        return null;
    }

gives:

** parseShortcutIcon4: Time: 0.540s Result for 'https://taz.de': '/apple-icon-57x57.png'. Result for 'https://www.sueddeutsche.de': 'https://www.sueddeutsche.de/pagelayout/assets/img/favicon/apple-touch-icon.png'. Result for 'https://www.neues-deutschland.de': '/apple-touch-icon.png'. Result for 'https://www.klimareporter.de': '/templates/klimareporter/favicon.ico'. Result for 'https://fridaysforfuture.de': 'https://fridaysforfuture.de/wp-content/uploads/2019/04/cropped-icon-32x32.png'. Result for 'https://www.spiegel.de': '/static/sys/v12/logo/favicon/touch-icon-iphone.png'. Result for 'https://medium.com': 'https://cdn-images-1.medium.com/fit/c/304/304/18I-HPL0bfoIzGied-dzOvA.png'. Result for 'https://www.php.net': 'https://www.php.net/favicon.ico'. Result for 'https://stackoverflow.com': 'https://cdn.sstatic.net/Sites/stackoverflow/img/favicon.ico?v=4f32ecc8f43d'. Result for 'https://github.com': 'https://github.githubassets.com/favicon.ico'. Result for 'http://www.chris-floyd.com': '/favicon.ico'. Result for 'https://www.antiwar.com': 'https://dgxhtav2e25a8.cloudfront.net/favicon.ico'. Result for 'https://www.wsws.org': '/img/apple-touch-icon.png'. Result for 'https://caitlinjohnstone.com': 'https://i1.wp.com/caitlinjohnstone.com/wp-content/uploads/2017/07/cropped-caitlinpic1.jpg?fit=32%2C32&#038;ssl=1'. Result for 'https://www.photovoltaik.eu': 'https://www.photovoltaik.eu/gentner/photovoltaik/images/favicon2013.ico'.

Themroc commented 5 years ago

Even better:

    function parseShortcutIcon5($html) {
        return (
            (preg_match('/<link\b[^>]*\brel=("|\')apple-touch-icon\1[^>]*>/Ui', $html, $match)
                || preg_match('/<link\b[^>]*\brel=("|\')shortcut icon\1[^>]*>/Ui', $html, $match)
                || preg_match('/<link\b[^>]*\brel=("|\')icon\1[^>]*>/Ui', $html, $match) )
            && preg_match('/href=("|\')(.+)\1/Ui', $match[0], $result)
            )
               ? $result[2]
               : null;
    }

Same performance as 4.

jtojnar commented 5 years ago

AFAICT, those icons are displayed with only 16x16 Pixels anyway

That depends on a client and a device. For example on HiDPI displays in web client (e.g. retina iPhone) 32×32 px would be used.

Same league, i'd say.

Yeah, I would say that in majority of cases, the you will have to search the whole page, as most of websites do not have all three variants.

jtojnar commented 5 years ago

Here are the stats for some extra variants, as well as the usage of each link type on the sample sites:

https://taz.de                     9    0   4
https://www.sueddeutsche.de        1    1   2
https://www.neues-deutschland.de   9    1   5
https://www.klimareporter.de       0    1   0
https://fridaysforfuture.de        0    0   2
https://www.spiegel.de             7    1   0
https://medium.com                 4    0   1
https://www.php.net                0    1   0
https://stackoverflow.com          0    1   0
https://github.com                 0    0   1
http://www.chris-floyd.com         0    1   1
https://www.antiwar.com            0    1   0
https://www.wsws.org               1    0   1
https://caitlinjohnstone.com       0    0   0
https://www.photovoltaik.eu        0    1   0
parseShortcutIcon1: 0.7996039390564 s
parseShortcutIcon2: 23.398230075836 s
parseShortcutIcon3: 0.12495994567871 s
parseShortcutIcon4: 0.82246804237366 s
parseShortcutIcon5: 22.572862863541 s
parseShortcutIcon6: 0.6842520236969 s
parseShortcutIcon7: 0.53298807144165 s
parseShortcutIcon8: 0.53591418266296 s

It looks like we will need to use preg_match_all after all, and select the best size manually:

<link rel="apple-touch-icon" sizes="180x180" href="https://www.sueddeutsche.de/pagelayout/assets/img/favicon/apple-touch-icon.png" />
<link rel="icon" type="image/png" sizes="32x32" href="https://www.sueddeutsche.de/pagelayout/assets/img/favicon/favicon-32x32.png" />
<link rel="icon" type="image/png" sizes="16x16" href="https://www.sueddeutsche.de/pagelayout/assets/img/favicon/favicon-16x16.png" />
<link rel="manifest" href="https://www.sueddeutsche.de/pagelayout/assets/manifest.json" />
<link rel="mask-icon" href="https://www.sueddeutsche.de/pagelayout/assets/img/favicon/safari-pinned-tab.svg" color="#ffffff" />
<link rel="shortcut icon" href="https://www.sueddeutsche.de/pagelayout/assets/img/favicon/favicon.ico" />
Source code ```php $text) { echo $url, str_repeat(' ', $maxUrlLength - strlen($url) + 3), preg_match_all('//Ui', $text), "\t", preg_match_all('/]*rel=("|\')shortcut icon\1.*>/Ui', $text), "\t", preg_match_all('/]*rel=("|\')icon\1.*>/Ui', $text) . PHP_EOL; } function parseShortcutIcon8($html) { return ( (preg_match('/]*\brel=("|\')apple-touch-icon\1[^>]*>/Ui', $html, $match) || preg_match('/]*\brel=("|\')shortcut icon\1[^>]*>/Ui', $html, $match) || preg_match('/]*\brel=("|\')icon\1[^>]*>/Ui', $html, $match) ) && preg_match('/href=("|\')(.+)\1/Ui', $match[0], $result) ) ? $result[2] : null; } function parseShortcutIcon7($html) { if ((preg_match('/]*\brel=("|\')apple-touch-icon\1[^>]*>/Ui', $html, $match) || preg_match('/]*\brel=("|\')shortcut icon\1[^>]*>/Ui', $html, $match) || preg_match('/]*\brel=("|\')icon\1[^>]*>/Ui', $html, $match) ) && preg_match('/href=("|\')(.+)\1/Ui', $match[0], $result) ) return $result[2]; return null; } function parseShortcutIcon6($html) { $result = preg_match('//Ui', $html, $match1); if ($result == 0) { $result = preg_match('/]*rel=("|\')(shortcut )?icon\1.*>/Ui', $html, $match1); } if ($result > 0) { $result = preg_match('/href=("|\')(.+)\1/Ui', $match1[0], $match2); if ($result > 0) { return $match2[2]; } } return false; } function parseShortcutIcon5($html) { $dom = new \DomDocument(); if (@$dom->loadHTML($html) !== true) { return null; } $xpath = new \DOMXPath($dom); $elems = $xpath->query("//head/link[@rel='apple-touch-icon']"); if ($elems->length === 0) { $elems = $xpath->query("//head/link[@rel='shortcut icon']"); } if ($elems->length === 0) { $elems = $xpath->query("//head/link[@rel='icon']"); } if ($elems->length > 0) { $icon = $elems->item(0); if ($icon->hasAttribute('href')) { return $icon->getAttribute('href'); } } return null; } function parseShortcutIcon4($html) { $end= strripos($html, ''); if ($end > 1) $html= substr($html, 0, $end); if (preg_match_all('!]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links) < 1) return null; //TODO: sort array descending by sizes-attribute foreach ($links[0] as $l) { if (preg_match('!\shref=("|\')(.+)\1!iU', $l, $href)) return $href[2]; } return null; } function parseShortcutIcon3($html) { if (preg_match('!]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links) < 1) return false; if (preg_match('!\shref=("|\')(.+)\1!iU', $links[0], $href) < 1) return false; return $href[2]; } function parseShortcutIcon2($html) { $dom = new \DomDocument(); if (@$dom->loadHTML($html) !== true) { return null; } $xpath = new \DOMXPath($dom); $elems = $xpath->query("//link[@rel='apple-touch-icon']"); if ($elems->length === 0) { $elems = $xpath->query("//link[@rel='shortcut icon']"); } if ($elems->length === 0) { $elems = $xpath->query("//link[@rel='icon']"); } if ($elems->length > 0) { $icon = $elems->item(0); if ($icon->hasAttribute('href')) { return $icon->getAttribute('href'); } } return null; } function parseShortcutIcon1($html) { $result = preg_match('//Ui', $html, $match1); if ($result == 0) { $result = preg_match('/]*rel=("|\')shortcut icon\1.*>/Ui', $html, $match1); } if ($result == 0) { $result = preg_match('/]*rel=("|\')icon\1.*>/Ui', $html, $match1); } if ($result > 0) { $result = preg_match('/href=("|\')(.+)\1/Ui', $match1[0], $match2); if ($result > 0) { return $match2[2]; } } return false; } for ($i=1; $i <= 8; $i++) { $function = "parseShortcutIcon$i"; $start = microtime(true); for ($t=0; $t < 200; $t++) { foreach ($texts as $text) { $function($text); } } $end = microtime(true); echo $function, ": ", ($end - $start), ' s', PHP_EOL; } ```
jtojnar commented 5 years ago

Probably an ideal solution from speed & correctness perspective would be SAX-like HTML parser.

$inHead = false;
$bestLink = null;

function handleSaxEvent($event) {
    global $inHead, $bestLink;
    if ($event->type === ELEMENT_START && $event->elemName === 'head') {
        $inHead = true;
    } else if ($inHead && $event->type === ELEMENT_START && $event->elemName === 'link') {
        if ($event->args better that $bestLink) {
            // update $bestLink
        }
    } else if ($event->type === ELEMENT_END && $event->elemName === 'head') {
        return false; // stop processing
    }
}

saxParser($body, 'handleSaxEvent');

Not sure if there is a fast PHP implementation, though. Does https://www.php.net/manual/en/class.xmlreader.php support HTML? Otherwise we would need to use https://packagist.org/packages/php-kit/html5-tools or https://packagist.org/packages/diggin/diggin-htmlsax.

For now, the preg_match_all variant should be sufficient, unless someone wants to investigate these options.

Themroc commented 5 years ago

I doubt that any real HTML parser (even a one in a c-library) will be much faster than parseShortcutIcon4. Also, there's a ton of old, weird and buggy pages out there, regex handles them all. Do those parsers? They only introduce unnecessary complexity, IMO.

So, i'll test some more websites with that preg_match_all(). Old and obscure ones preferred, with a few right-wingers and coal lobbyists for balance...

You want the biggest icon available, right?

jtojnar commented 5 years ago

My concern was primarily things like <!-- <link rel="icon"… and CDATA. You cannot handle it with pure regexes. But perhaps that is a corner case that we can get away with not handling.

Yeah, the biggest icon should be okay. We can come up with more advanced heuristics like “smallest icon ≥ 64×64, or biggest icon when no such is available” later, for now the parity with the current state is enough. We should probably filter out SVG icons ([@type=image/svg+xml]) as we cannot currently handle them.

Themroc commented 5 years ago

HTML comments should be easy, no? Something like $html= preg_replace('/<!--.*-->/sU', '', $html) only that the 's'-modifier in php never works...

Anyway, i've added a bunch of urls and made preg_match_all- version that picks the biggest icon if "sizes"-attribute is present. If not, the apple one is preferred. Assumption is that if "sizes" really contains what its name suggests, that the biggest one comes first. But luckily, until now, there was always just one size in there. Problems:

Sry, i'm too dumb to upload a php file here... So:

<?php

# Careful! This writes a lot of files to curdir

$urls= [
    'https://taz.de',
    'https://www.sueddeutsche.de',
    'https://www.neues-deutschland.de',
    'https://www.klimareporter.de',
    'https://fridaysforfuture.de',
    'https://www.spiegel.de',
    'https://medium.com',
    'https://www.php.net',
    'https://stackoverflow.com',
/*
 focin http item on https page again... too lazy to fix now...
    'http://www.chris-floyd.com',
    'http://en.people.cn',
    'http://www.xinhuanet.com',
    'http://www.daelim.co.kr',
    'https://www.aboveandbeyond.nu',
*/
/*
 read error
    'https://www.almasdarnews.com',
    'https://consortiumnews.com',
    'https://rassemblementnational.fr',
    'https://www.africanews.com',
    'https://en.mercopress.com',
    'https://www.news.com.au',
    'https://www.kcna.kp',
    'https://www.voynich.nu',
    'https://nigerposte.ne',
    'https://jiji.ng',
    'https://kimcartoon.yt',
*/
/*
 requires "continue to site" click or login
    'https://www.npr.org',
    'https://pfm.gifmis.gov.ng',
*/
    'https://www.antiwar.com',
    'https://www.wsws.org',
    'https://caitlinjohnstone.com',
    'https://www.photovoltaik.eu',
    'https://www.jungewelt.de',
    'https://orientalreview.org',
    'https://www.tomdispatch.com',
    'https://www.moonofalabama.org',
    'https://weltnetz.tv',
    'https://off-guardian.org',
    'https://fair.org',
    'https://www.dasmili.eu',
    'https://www.presstv.com',
    'https://www.rt.com',
    'https://blog.fefe.de',
    'https://uebermedien.de',
    'https://taibbi.substack.com',
    'https://theduran.com',
    'https://www.theamericanconservative.com',
    'https://thebrexitparty.com',
    'https://www.thenews.com.pk',
    'https://timesofindia.indiatimes.com',
    'https://www.themoscowtimes.com',
    'https://sana.sy',
    'https://www.hurriyetdailynews.com',
    'https://www.kurdistan24.net',
    'https://www.bbc.com',
    'https://www.aljazeera.com',
    'https://edition.cnn.com',
    'https://www.msn.com',
    'https://www.telesurtv.net',
    'https://www.nytimes.com',
    'https://www.scmp.com',
    'https://www.mexiconews.net',
    'https://www.iheartradio.ca',
    'https://www.globalresearch.ca',
    'https://rpp.pe',
    'https://trome.pe',
    'https://7news.com.au',
    'https://www.tvnz.co.nz',
    'https://www.mufg.jp',
    'https://www.tsrc.com.tw',
    'https://www.javanews.al',
    'https://www.legalinfo.mn',
    'https://www.newtimes.co.rw',
    'https://www.rica.rw',
    'https://www.dlb.lk',
    'https://svgcc.vc',
    'https://macaudailytimes.com.mo',
    'https://alytausgidas.lt',
    'https://www.duwun.com.mm',
    'https://www.rfs.edu.ps',
    'https://www.pau.ac.pg',
    'https://www.daraz.com.np',
    'https://en.emprego.mmo.co.mz',
    'https://www.lexpress.mu',
    'https://pixelmon.ms',
    'https://mailchi.mp',
    'https://www.annonces.nc',
    'https://www.nouvata.nc',
    'https://rabot.nc',
    'https://www.biz.nf',
    'https://www.phanunpich.webs.nf',
    'https://dailypost.ng',
    'https://www.legit.ng',
    'https://www.reaper.fm',
    'https://handzonradio.fm',
    'https://gradecalculator.mes.fm',
    'https://www.spl.yt',
];

$pre= "<html>\n<head>\n</head>\n<body>\n<table border='1'>\n";
$post= "</table>\n</body>\n</html>\n";
$fmt= "<tr><td>%s</td><td align='center'>%s</td><td>%s</td></tr>\n";

$flag= 0;
foreach ($urls as $u) {
    $f= preg_replace('![^a-z0-9]!i', '_', $u);
    if (file_exists($f))
        continue;
    printf("Downloading '%s'... ", $u);
    $flag++;
    $h= file_get_contents($u);
    printf("got %d bytes.\n", strlen($h));
    file_put_contents($f, $h);
}
if ($flag)
    die(sprintf("Init done, %d URLs. Now point your broser to this script.\n", count($urls)));

echo $pre;
foreach ($urls as $u) {
    $f= preg_replace('![^a-z0-9]!i', '_', $u);
    if (!file_exists($f))
        die("where's da focin file?\n");
    $html= file_get_contents($f);
    $ue= parse_url($u);
    $iurl= parseShortcutIcon($html);
    $itxt= '';
    if (empty($iurl)) {
        $ue['path']= '/favicon.ico';
        $itxt= 'GUESS: ';
    } else {
        $ie = parse_url($iurl);
        if (!empty($ie['host']))
            $ue['host']= $ie['host'];
        if (!empty($ie['scheme']))
            $ue['scheme']= $ie['scheme'];
        if (!empty($ie['path']))
            $ue['path']= preg_replace('!^/+!', '/', '/'.$ie['path']);
    }
    $iurl= $ue['scheme'] . '://' . $ue['host'] . $ue['path'];
    printf($fmt, $u, '<img src="'.$iurl.'" style="max-width:256px; max-height:256px"></img>', $itxt.$iurl);
}
echo $post;

################################################################################

function parseShortcutIcon($html) {
    $end= strripos($html, '</head>');
    if ($end > 1)
        $html= substr($html, 0, $end);

    if (preg_match_all('!<link\b[^>]*\brel=("|\')(apple-touch-icon|shortcut icon|icon)\1[^>]*>!iU', $html, $links, PREG_SET_ORDER) < 1)
        return null;

    $icons= [];
    foreach ($links as $link)
        if (preg_match('!\shref=("|\')(.+)\1!iU', $link[0], $href))
            $icons[]= [
                $href[2],
                preg_match('!\ssizes=("|\')([0-9\.]+).*\1!i', $link[0], $sizes)
                    ? $sizes[2]
                    : ( strtolower($link[2])=='apple-touch-icon' ? 64 : 0 ),
            ];
    if (!count($icons))
        return null;

    usort($icons, function ($a, $b) {
        return $a[1]==$b[1]
            ? 0
            : ( $a[1]<$b[1] ? 1 : -1 );
    });

    return $icons[0][0];
}
Themroc commented 5 years ago

Looks like Guzzle has some support for reporting back the redirect-url. Can't find any docu, but what else should RedirectMiddleware\modifyRequest() and Psr7\modify_request() be good for?

Themroc commented 5 years ago

Yes, it's easy. In \helpers\WebClient add something like

    public static function reqRedir($url, $agentInfo = null) {
        $http = self::getHttpClient();
        $eurl = $url;
        $opt = [
            'on_stats' => function (TransferStats $stats) use (&$eurl) {
                $eurl = $stats->getEffectiveUri();
            },
        ];
        $response = $http->get($url, $opt);

        $data = (string) $response->getBody();

        if ($response->getStatusCode() !== 200) {
            throw new Exception(substr($data, 0, 512));
        }

        return [ $data, $eurl ];
    }

Or expand the current request(). The $agentInfo is ignored anyway, so we can safely abuse the second parameter:

    public static function request($url, $opts = null) {
        $http = self::getHttpClient();
        $agent = null;
        $redir = 0;
        $eurl = $url;
        $opt = [];
        if (is_array($opts)) {
            if (@$opts['getRedirUrl']) {
                $opt['on_stats'] = function (TransferStats $stats) use (&$eurl) { $eurl = $stats->getEffectiveUri(); };
                $redir = 1;
            }
#TODO:      $agent = @$opts['agentInfo'];
        }
#        else if (!empty($opts))
#            $agent = $opts;

        $response = $http->get($url, $opt);

        $data = (string) $response->getBody();

        if ($response->getStatusCode() !== 200) {
            throw new Exception(substr($data, 0, 512));
        }

        return $redir ? [ $data, $eurl ] : $data;
    }

I find both ways equally ugly, but what can u do...

Themroc commented 5 years ago

SVG shouldn't be too hard to implement: https://stackoverflow.com/questions/10289686/rendering-an-svg-file-to-a-png-or-jpeg-in-php#10289762 But only If ImageMagick is available.

jtojnar commented 5 years ago

If I recall correctly, the getEffectiveUrl() is from old version Guzzle.

We handle the redirect history in the SimplePie Guzzle shim:

https://github.com/SSilence/selfoss/blob/158c90dbedacc48d1f69f3a9fc23c23a49d23325/helpers/SimplePieFileGuzzle.php#L32-L40

We probably want to copy it elsewhere still.

Themroc commented 5 years ago

Boy do i hate php! Simple testcase:

<?php

parseShortcutIcons(file_get_contents('scmp.com.html'));

function parseShortcutIcons($html) {
        printf("%7d bytes html original.\n", strlen($html));
        $html= preg_replace('!<script\b[^>]*>.*</script>!siU', '', $html);
        printf("%7d bytes after removing scripts.\n", strlen($html));
}

Works perfectly when file 'scmp.com.html' contains testdata like

bla bla bla

. But with the real homepage of the South China Morning Post, it outputs:

2049259 bytes html original. 0 bytes after removing scripts.

Tested with PHP 5.6.30-0+deb8u1 and PHP 7.0.33-0+deb9u3 on two different machines plus a hoster running PHP 5.2.12.

Same thing in perl looks like

use File::Slurp;

parseShortcutIcons(read_file('scmp.com.html'));

sub parseShortcutIcons {
        my ($html)= @_;

        printf("%7d bytes html original.\n", length($html));
        $html=~ s!<script\b[^>]*?>.*?</script>!!sig;
        printf("%7d bytes after removing scripts.\n", length($html));
}

It's the same regex, only perl has no "U"-modifier, one needs to append an "?" to make the ".*" ungreedy. And it needs a "g" to apply the regex globally, whereas preg_replace() always works globally. Output:

2049259 bytes html original. 331052 bytes after removing scripts.

Thus, a really functioning equivalent of perl's $html=~ s!<script\b[^>]*?>.*?</script>!!sig; in php has to look something like

    while (false !== $p1= stripos($html, '<script')) {
        $block= substr($html, $p1);
        $html= substr($html, 0, $p1);
        if (false !== $p2= stripos($block, '</script>'))
            $html.= substr($block, $p2+strlen('</script>'));
    }
Themroc commented 5 years ago

Had found it here: https://stackoverflow.com/questions/30682307/how-to-read-the-response-effective-url-in-guzzle-6-0/35443523#35443523. But the method from SimplePieFileGuzzle.php looks better, gonna try that one.

Themroc commented 5 years ago

Nope :( $response->getHeader(\GuzzleHttp\RedirectMiddleware::HISTORY_HEADER) gives me an empty array. And i guess we have 6.3, as composer.json says

"guzzlehttp/guzzle": "^6.3",

Themroc commented 5 years ago

But what i have now in helpers/WebClient.php doesn't look too bad, no?

    /**
     * get the final URL after a redirect took place (same as original URL if none happend)
     *
     * @return string URL
     */
    public static function getEffectiveUrl() {
        return self::$effectiveUrl;
    }

    /**
     * Retrieve content from url
     *
     * @param string $url
     * @param ?string $agentInfo Extra user agent info to use in the request
     *
     * @throws GuzzleHttp\Exception\RequestException When an error is encountered
     * @throws Exception Unless 200 0K response is received
     *
     * @return string request data
     */
    public static function request($url, $agentInfo = null) {
        $http = self::getHttpClient();
        $eurl = $url;
        $opts = [
            'on_stats' => function (TransferStats $stats) use (&$eurl) { $eurl = $stats->getEffectiveUri(); },
        ];

        $response = $http->get($url, $opts);
        $data = (string) $response->getBody();
        self::$effectiveUrl = $eurl;

        if ($response->getStatusCode() !== 200) {
            throw new Exception(substr($data, 0, 512));
        }

        return $data;
    }
deathbybandaid commented 4 years ago

I'm having trouble understanding the php code involved here, as well as what the end-goal is. (just not my kind of language)

I believe I've been troubleshooting Huginn for hours, for an issue that is selfoss, which I believe is related to the issue in this thread.

I can confirm that the current release of selfoss appears to completely ignore the icon I'm setting in my rss feed, and uses the favicon for the domain that I host Huginn.

My current project is a custom rss feed for youtube channels that (among other things) contain the channel thumbnail instead of the youtube favicon.

<atom:icon>https://yt3.ggpht.com/a/AATXAJzGUJdH8PJ5d34Uk6CYZmAMWtam2Cpk6tU_Qw=s512-c-k-c0xffffffff-no-rj-mo</atom:icon>

For testing purposes, I have LXC setup for ttrss, commafeed, and freshrss. (all of which I've tried in the past)

freshrss and ttrss grabbed the icon from link, and funnily enough (since I run multiple rss things on the same subdomain) commafeed pulled the selfoss icon for some reason.

So realistically, the other readers that are semi-comparable don't seem to use this parameter.

jtojnar commented 4 years ago

On master, the feed logo is now preferred over favicons. Thanks Themroc for reporting and the pull request.

Though some sites (e.g. https://rss.golem.de/rss.php?tp=av&feed=RSS2.0) use rectangular site logos (https://www.golem.de/staticrl/images/golem-rss.png) so we should probably only use the logo if it is close to square.