fossar / selfoss

multipurpose rss reader, live stream, mashup, aggregation web application
https://selfoss.aditu.de
GNU General Public License v3.0
2.38k stars 345 forks source link

Duplicate images #1230

Open davidoskky opened 4 years ago

davidoskky commented 4 years ago

I use the fulltext spout and in some feeds every image on the page is shown twice. This feed for reference: https://uxmovement.com/feed/

niol commented 4 years ago

I have the same and need to look into it.

jtojnar commented 4 years ago

Hmm, weird. I cannot reproduce this on master with the provided feed and fulltextrss spout. What version of selfoss are you running?

davidoskky commented 4 years ago

I am sorry, I wasn't on the latest release but on 16f5d68. I upgraded to c87e14a, removed that feed and added it back again to make sure it wasn't cached and the problem persists. Is there any test I might do that might help you determine the problem?

jtojnar commented 4 years ago

There were only translation and CI updates in the meanwhile so that should not have any effect.

Could you set logger_level=DEBUG in config.ini and possibly change zero to two in https://github.com/fossar/selfoss/blob/c87e14aaba42abe281ca3c859a0006fb292ff22c/src/common.php#L28 and then share the log when updating the single source?

davidoskky commented 4 years ago

Here it is: https://pastebin.com/7wubSz0V

jtojnar commented 4 years ago

This is what is on the web page:

<p><img data-lazyloaded=\"1\" src=\"data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI1NzIiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNTcyIDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>

This is what I see in the database for the Strong layout hierarchy article:

<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>

Looks like for some reason filtering is not triggered for you. Could you please look into the database and see what it gets turned into?

niol commented 4 years ago

In my case the image is twice in the content field in the database. In the source site, the image is also present twice, but the second time it is wrapped in a noscript tag. This is the same as the Strong layout hierarchy article. The noscript tag is gone in the database probably because of sanitizeContent(). Whitelisting noscript should be the simple fix for this.

jtojnar commented 4 years ago

I would expect sanitizeContent to remove noscript completely, even with its content like it seems to do for me. It is super weird that there is a difference in behaviour, even though we presumably have the same set of composer libraries. I have PHP 7.4.3 with the following modules:

$ php -m
[PHP Modules]
bcmath
calendar
Core
ctype
curl
date
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
hash
iconv
intl
json
libxml
mbstring
mysqli
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_pgsql
pdo_sqlite
pgsql
Phar
posix
readline
Reflection
session
shmop
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
xml
xmlreader
xmlwriter
xsl
Zend OPcache
zip
zlib

[Zend Modules]
Zend OPcache
davidoskky commented 4 years ago

In my case that image is twice in the database but there is no noscript tag.

<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
davidoskky commented 4 years ago

I have php 7.2.24 and I do not have FFI, intl, pdo_pgsql, pgsql and soap modules.

jtojnar commented 4 years ago

By running the following PHP script in the selfoss directory, I determined that the noscript elements are still present in Graby’s DOM after site config stripping phase but not in Body after cleanupHtml, before cleanupXss.

<?php

require __DIR__ . '/src/common.php';

// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);

$spout = $dice->create(spouts\rss\fulltextrss::class);

$params = [
  'url' => 'https://uxmovement.com/feed/',
];
$spout->load($params);
foreach ($spout as $item) {
  $item->getContent();
  break;
}
davidoskky commented 4 years ago

I ran the script, but the noscript elements are still present even in the phase Body after cleanupHtml, before cleanupXss. Full log: https://pastebin.com/fmUYkic7

jtojnar commented 4 years ago

After playing a bit with cleanupHtml method, trying to reproduce it here, it does not seem to come from there.

<?php
require __DIR__ . '/src/common.php';
$doc = new DOMDocument();
$doc->loadHTML("<p><img data-lazyloaded=\"1\" src=\"data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHdpZHRoPSI1NzIiIGhlaWdodD0iNDA4IiB2aWV3Qm94PSIwIDAgNTcyIDQwOCI+PHJlY3Qgd2lkdGg9IjEwMCUiIGhlaWdodD0iMTAwJSIgZmlsbD0iI2NmZDRkYiIvPjwvc3ZnPg==\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>");
$contentBlock = $doc->documentElement->getElementsByTagName('p')->item(0);
$readability = $dice->create(Readability\Readability::class);
$readability->clean($contentBlock, 'select');
$contentBlock->normalize();
$html = $contentBlock->ownerDocument->saveXML($contentBlock);
echo $html;

So it must come from doFetchContent.

jtojnar commented 4 years ago

Right, the bad code is last seen at https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L374 called by https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Graby.php#L325. And in fact 300 lines lower in the process monster method, there is a code that should remove the noscript elements:

https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L618-L630

jtojnar commented 4 years ago

At this point I would start sprinkling var_dump('something'); calls around the code in vendor/j0k3r/graby/src/Extractor/ContentExtractor.php to narrow down why is not the noscript code run when running the following:

<?php

require __DIR__ . '/src/common.php';

use Graby\Graby;
use Http\Adapter\Guzzle6\Client as GuzzleAdapter;
use Monolog\Handler\ErrorLogHandler;

// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);

$webClient = $dice->create(helpers\WebClient::class);
$url = "https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/";
$graby = new Graby([
    'extractor' => [
        'config_builder' => [
            'site_config' => [\F3::get('ftrss_custom_data_dir')],
        ],
    ],
], new GuzzleAdapter($webClient->getHttpClient()));
// $graby->setLogger($log);
echo $graby->fetchContent($url)['html'];
jtojnar commented 4 years ago

Reading your log more closely, DOM after site config stripping is still unprocessed but Body after cleanupHtml, before cleanupXss actually turns the lazy-loaded image into a real one, it is just that the image also becomes wrapped in p element for you between those two code locations so the noscript cleaner cannot remove it.

jtojnar commented 4 years ago

Actually, the wrapped images first appear in Body after Readability.

jtojnar commented 4 years ago

Unless there is difference in tidy settings, it must be in readability:

--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -198,7 +198,7 @@
             unset($count);
         }

-        $this->logger->debug('HTML after site config strings replacements', ['html' => $html]);
+        $this->logger->debug('HTML after site config strings replacements', ['html' => $html]); // here the images are unwrapped for you still

         // load and parse html
         $parser = $this->siteConfig->parser();
@@ -208,13 +208,16 @@
             $parser = $this->config['default_parser'];
         }

-        $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]);
+        $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]); // here you have same libxml as me

         $this->readability = $this->getReadability($html, $url, $parser, $this->siteConfig->tidy() && $smartTidy);
         $tidied = $this->readability->tidied;
+        var_dump(['sc-tidy' => $this->siteConfig->tidy()]); // true for me
+        var_dump(['smarttidy' => $smartTidy]); // true for me
+        var_dump(['tidied' => $tidied]); // false for me

         $this->logger->info('Body size after Readability: {length}', ['length' => \strlen($this->readability->dom->saveXML())]);
-        $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]);
+        $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]); // here the images are already wrapped for you!!!

         // we use xpath to find elements in the given HTML document
         $this->xpath = new \DOMXPath($this->readability->dom);
@@ -313,6 +316,7 @@
         }

         // strip elements (using xpath expressions)
+        var_dump(['strip' => $this->siteConfig->strip]);
         foreach ($this->siteConfig->strip as $pattern) {
             $this->logger->info('Trying {pattern} to strip element', ['pattern' => $pattern]);
             $elems = $this->xpath->query($pattern, $this->readability->dom);
@@ -342,6 +346,7 @@
         }

         // strip images (using src attribute values)
+        var_dump(['strip_image_src' => $this->siteConfig->strip_image_src]);
         foreach ($this->siteConfig->strip_image_src as $string) {
             $string = strtr($string, ["'" => '', '"' => '']);

@@ -559,7 +564,7 @@
         }

         if ($detectBody && $success) {
-            $this->logger->info('Detecting body');
+            $this->logger->info('Detecting body'); // this gets printed in your logs
             $this->body = $this->readability->getContent();

             if (1 === $this->body->childNodes->length && XML_ELEMENT_NODE === $this->body->firstChild->nodeType) {
@@ -573,6 +578,7 @@
             }
         }

+        // var_dump(['bodySet' => $this->body]);
         if (isset($this->body)) {
             // remove any h1-h6 elements that appear as first thing in the body
             // and which match our title
@@ -603,6 +609,7 @@
             }

             // remove image lazy loading
+            var_dump(['lazy_load_attributes' => $this->config['src_lazy_load_attributes']]);
             foreach ($this->body->getElementsByTagName('img') as $e) {
                 $hasAttribute = false;
                 foreach ($this->config['src_lazy_load_attributes'] as $attribute) {
@@ -611,6 +618,7 @@
                     }
                 }

+                var_dump(['hasAttribute' => $hasAttribute]);
                 if (false === $hasAttribute) {
                     continue;
                 }
@@ -620,6 +628,7 @@
                 // inside the data-lazy-src attribute. It also places the original image inside a noscript element
                 // next to the amended one.
                 // @see https://plugins.trac.wordpress.org/browser/lazy-load/trunk/lazy-load.php
+                var_dump(['nextSibling' => $e->nextSibling !== null ? $e->nextSibling->nodeName : 'null']);
                 if (null !== $e->nextSibling && 'noscript' === $e->nextSibling->nodeName) {
                     $newElem = $e->ownerDocument->createDocumentFragment();
                     $newElem->appendXML($e->nextSibling->innerHTML);
@@ -660,7 +669,7 @@
             );
         }

-        $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]);
+        $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]); // This gets prited as well.

         return $this->success;
     }
jtojnar commented 4 years ago

This code is particularly suspicious:

https://github.com/j0k3r/php-readability/blob/c6425cc28bba80967f13d8a87b97386838500c5a/src/Readability.php#L997-L1011

davidoskky commented 4 years ago

Here is the output after running that script: https://pastebin.com/zKj9VVbb I do not get anything in the logs though, should I also give you the logs after refreshing the source? In my case tidied is true.

jtojnar commented 4 years ago

I commented out the $graby->setLogger($log); line in the code above so that the output does not get lost in the noise.

It would be useful to uncomment it and try to enable logging for readability itself:

--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -1008,7 +1008,7 @@
      */
     private function getReadability($html, $url, $parser, $enableTidy)
     {
-        $readability = new Readability($html, $url, $parser, $enableTidy);
+        $readability = new Readability($html, $url, $parser, $enableTidy, $this->logger);

         if (isset($this->config['readability']['pre_filters']) && \is_array($this->config['readability']['pre_filters'])) {
             foreach ($this->config['readability']['pre_filters'] as $filter => $replacer) {
--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -177,14 +177,14 @@
      * @param string $parser   Which parser to use for turning raw HTML into a DOMDocument
      * @param bool   $use_tidy Use tidy
      */
-    public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true)
+    public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true, $logger = null)
     {
         $this->url = $url;
         $this->html = $html;
         $this->parser = $parser;
         $this->useTidy = $use_tidy && \function_exists('tidy_parse_string');

-        $this->logger = new NullLogger();
+        $this->logger = $logger->withName('php-readability') ?: new NullLogger();
         $this->loadHtml();
     }

For me there are not really many php-readability logs before selfoss.DEBUG: Body after Readability:

[2020-11-12 19:05:02] php-readability.DEBUG: Parsing URL: https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/  

[2020-11-12 19:05:02] php-readability.DEBUG: Tidying document  
jtojnar commented 4 years ago

Also turns out the tidied === true is not it either, I installed php-tidy in my work directory and it makes no difference (I had it on the web server before anyway).

davidoskky commented 4 years ago

Same in my case, not many readability logs before that one: https://pastebin.com/htYqhYhC

jtojnar commented 4 years ago

Looks like https://uxmovement.com/wp-content/uploads/2020/02/wireframe-tout.png is actually not wrapped in p but https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png is – the latter is not followed by </p> in HTML after site config strings replacements but is followed by it in Body after Readability.

We are running out of options, must be one of these options:

--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -1391,6 +1391,7 @@
         mb_regex_encoding('UTF-8');

         // HACK: dirty cleanup to replace some stuff; shouldn't use regexps with HTML but well...
+        var_dump(['pre_filters' => $this->pre_filters]);
         if (!$this->flagIsActive(self::FLAG_DISABLE_PREFILTER)) {
             foreach ($this->pre_filters as $search => $replace) {
                 $this->html = preg_replace($search, $replace, $this->html);
@@ -1410,7 +1411,10 @@
         if ($this->useTidy) {
             $this->logger->debug('Tidying document');

+            var_dump(['tidy_config' => $this->tidy_config]);
+            var_dump(['pre-tidy' => $this->html]);
             $tidy = tidy_repair_string($this->html, $this->tidy_config, 'UTF8');
+            var_dump(['post-tidy' => $tidy]);
             if (false !== $tidy && $this->html !== $tidy) {
                 $this->tidied = true;
                 $this->html = $tidy;
@@ -1431,6 +1435,7 @@
             $this->dom = new \DOMDocument();
             $this->dom->preserveWhiteSpace = false;
             $this->dom->loadHTML($this->html, LIBXML_NOBLANKS | LIBXML_COMPACT | LIBXML_NOERROR);
+            var_dump(['post-load' => $this->dom->saveHTML()]);

             libxml_use_internal_errors(false);
         }

I would bet on the but it is weird that it would behave differently for us. Even though it is consistent for me on Nix with tidy 5.6.0 and Ubuntu. Ubuntu has that same version.

davidoskky commented 4 years ago

Here is the log https://pastebin.com/EKcxR5sB The problem does in fact appear to happen after tidy. I have installed the package php7.2-tidy from Ubuntu repository.

jtojnar commented 4 years ago

Looking at your config it is actually only the post-load dump that adds the </p> (look at the first instance of https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png in each text).

In fact diffing our dumps for these three steps show that only loading is different – tidy works the same for us.

jtojnar commented 4 years ago

Looking at the post-tidy HTML in validator, it seems that your libxml is probably behaving correctly.

<noscript> is supposed to have transparent content model, that is, only elements that would be allowed if the <noscript> element was not there are allowed. Since the <noscript> is inside <p>, it cannot contain <p> so when it occurs in there, the previous </p> is closed (which must be before <noscript> to avoid opening and closing tag crossing).

My parser seems to be unaware that paragraphs cannot contain paragraphs so it parses as if it were well-formed XML document.

Could you try to echo LIBXML_DOTTED_VERSION; in the script? I have 2.9.10.

davidoskky commented 4 years ago

I get 2.9.4

jtojnar commented 4 years ago

So we actually have two bugs:

jtojnar commented 4 years ago

Seems to be something specific to our config, I can reproduce the tidy bug with:

<?php
$tidy_config = [
    'tidy-mark' => false,
    'vertical-space' => false,
    'doctype' => 'omit',
    'numeric-entities' => false,
    // 'preserve-entities' => true,
    'break-before-br' => false,
    'clean' => true,
    'output-xhtml' => true,
    'logical-emphasis' => true,
    'show-body-only' => false,
    'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
    'new-empty-tags' => 'command embed keygen source track wbr',
    'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
    'wrap' => 0,
    'drop-empty-paras' => true,
    'drop-proprietary-attributes' => false,
    'enclose-text' => true,
    'enclose-block-text' => true,
    'merge-divs' => true,
    // 'merge-spans' => true,
    'input-encoding' => '????',
    'output-encoding' => 'utf8',
    'hide-comments' => true,
];
$html = '<p><img src="foo.png" alt="" /><noscript><img src="foo.png" alt="" /></noscript></p>';
var_dump(tidy_repair_string($html, $tidy_config, 'UTF8'));

If I use empty config, it does not add the incorrect paragraph.

Edit: Looks like it is the enclose-block-text.

davidoskky commented 4 years ago

Setting enclose-block-text to false does make a difference: the noscript is not enclosed in p anymore. However it is still present and the image still gets displayed twice. log: https://pastebin.com/bA1rtNEU

jtojnar commented 4 years ago

That looks like your libxml also does not want <noscript> inside a paragraph so it closes it prematurely. This behaviour is not correct according to my reading of https://developer.mozilla.org/en-US/docs/Web/HTML/Element/noscript.

Geez, one would expect these ancient libraries to be without bugs.

jtojnar commented 4 years ago

Tried the tidy repro with tidy from the git repo and it still shown the same bug:

--- a/flake.nix
+++ b/flake.nix
@@ -15,12 +15,27 @@
   outputs = { self, flake-compat, nixpkgs, utils }:
     utils.lib.eachDefaultSystem (system:
       let
-        pkgs = nixpkgs.legacyPackages.${system};
+        pkgs = import nixpkgs.outPath {
+          inherit system;
+          overlays = [
+            (final: prev: {
+              html-tidy = prev.html-tidy.overrideAttrs (attrs: {
+                src = prev.fetchFromGitHub {
+                  owner = "htacg";
+                  repo = "tidy-html5";
+                  rev = "188988022da4a64d80bc4a2eba21c33d57eb5152";
+                  sha256 = "GRs5qF7OBv0CX/b8kLnnQsENZDjBRAn0uXJvWfJDQkg=";
+                };
+              });
+            })
+          ];
+        };
       in {
         devShell =
           let
             php = pkgs.php.withExtensions ({ enabled, all }: with all; enabled ++ [
               imagick
+              tidy
             ]);
           in
             pkgs.mkShell {

Opened a bug against tidy: https://github.com/htacg/tidy-html5/issues/905

jtojnar commented 4 years ago

At this point I see these possible solutions: