Open davidoskky opened 4 years ago
I have the same and need to look into it.
Hmm, weird. I cannot reproduce this on master with the provided feed and fulltextrss spout. What version of selfoss are you running?
I am sorry, I wasn't on the latest release but on 16f5d68. I upgraded to c87e14a, removed that feed and added it back again to make sure it wasn't cached and the problem persists. Is there any test I might do that might help you determine the problem?
There were only translation and CI updates in the meanwhile so that should not have any effect.
Could you set logger_level=DEBUG
in config.ini
and possibly change zero to two in https://github.com/fossar/selfoss/blob/c87e14aaba42abe281ca3c859a0006fb292ff22c/src/common.php#L28 and then share the log when updating the single source?
Here it is: https://pastebin.com/7wubSz0V
This is what is on the web page:
<p><img data-lazyloaded=\"1\" src=\"\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>
This is what I see in the database for the Strong layout hierarchy article:
<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
Looks like for some reason filtering is not triggered for you. Could you please look into the database and see what it gets turned into?
In my case the image is twice in the content
field in the database.
In the source site, the image is also present twice, but the second time it is wrapped in a noscript
tag.
This is the same as the Strong layout hierarchy article.
The noscript
tag is gone in the database probably because of sanitizeContent()
. Whitelisting noscript
should be the simple fix for this.
I would expect sanitizeContent
to remove noscript
completely, even with its content like it seems to do for me. It is super weird that there is a difference in behaviour, even though we presumably have the same set of composer libraries. I have PHP 7.4.3 with the following modules:
$ php -m
[PHP Modules]
bcmath
calendar
Core
ctype
curl
date
dom
exif
FFI
fileinfo
filter
ftp
gd
gettext
hash
iconv
intl
json
libxml
mbstring
mysqli
mysqlnd
openssl
pcntl
pcre
PDO
pdo_mysql
pdo_pgsql
pdo_sqlite
pgsql
Phar
posix
readline
Reflection
session
shmop
SimpleXML
soap
sockets
sodium
SPL
sqlite3
standard
sysvmsg
sysvsem
sysvshm
tidy
tokenizer
xml
xmlreader
xmlwriter
xsl
Zend OPcache
zip
zlib
[Zend Modules]
Zend OPcache
In my case that image is twice in the database but there is no noscript tag.
<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
<p><img src="https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png" alt="" width="572" height="408" /></p>
I have php 7.2.24 and I do not have FFI, intl, pdo_pgsql, pgsql and soap modules.
By running the following PHP script in the selfoss directory, I determined that the noscript
elements are still present in Graby’s DOM after site config stripping
phase but not in Body after cleanupHtml, before cleanupXss
.
<?php
require __DIR__ . '/src/common.php';
// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);
$spout = $dice->create(spouts\rss\fulltextrss::class);
$params = [
'url' => 'https://uxmovement.com/feed/',
];
$spout->load($params);
foreach ($spout as $item) {
$item->getContent();
break;
}
I ran the script, but the noscript elements are still present even in the phase Body after cleanupHtml, before cleanupXss
.
Full log: https://pastebin.com/fmUYkic7
After playing a bit with cleanupHtml
method, trying to reproduce it here, it does not seem to come from there.
<?php
require __DIR__ . '/src/common.php';
$doc = new DOMDocument();
$doc->loadHTML("<p><img data-lazyloaded=\"1\" src=\"\" class=\"aligncenter size-full wp-image-32075\" data-src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /><noscript><img class=\"aligncenter size-full wp-image-32075\" src=\"https://uxmovement.com/wp-content/uploads/2020/11/layout-beforeafter.png\" alt=\"\" width=\"572\" height=\"408\" /></noscript></p>");
$contentBlock = $doc->documentElement->getElementsByTagName('p')->item(0);
$readability = $dice->create(Readability\Readability::class);
$readability->clean($contentBlock, 'select');
$contentBlock->normalize();
$html = $contentBlock->ownerDocument->saveXML($contentBlock);
echo $html;
So it must come from doFetchContent
.
Right, the bad code is last seen at https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L374 called by https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Graby.php#L325. And in fact 300 lines lower in the process
monster method, there is a code that should remove the noscript
elements:
https://github.com/j0k3r/graby/blob/2.0.0-alpha.0/src/Extractor/ContentExtractor.php#L618-L630
At this point I would start sprinkling var_dump('something');
calls around the code in vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
to narrow down why is not the noscript
code run when running the following:
<?php
require __DIR__ . '/src/common.php';
use Graby\Graby;
use Http\Adapter\Guzzle6\Client as GuzzleAdapter;
use Monolog\Handler\ErrorLogHandler;
// print logs just to terminal
$handler = new ErrorLogHandler(ErrorLogHandler::OPERATING_SYSTEM, 'debug');
$handler->setFormatter($formatter);
$log->popHandler();
$log->pushHandler($handler);
$webClient = $dice->create(helpers\WebClient::class);
$url = "https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/";
$graby = new Graby([
'extractor' => [
'config_builder' => [
'site_config' => [\F3::get('ftrss_custom_data_dir')],
],
],
], new GuzzleAdapter($webClient->getHttpClient()));
// $graby->setLogger($log);
echo $graby->fetchContent($url)['html'];
Reading your log more closely, DOM after site config stripping
is still unprocessed but Body after cleanupHtml, before cleanupXss
actually turns the lazy-loaded image into a real one, it is just that the image also becomes wrapped in p
element for you between those two code locations so the noscript
cleaner cannot remove it.
Actually, the wrapped images first appear in Body after Readability
.
Unless there is difference in tidy
settings, it must be in readability:
--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -198,7 +198,7 @@
unset($count);
}
- $this->logger->debug('HTML after site config strings replacements', ['html' => $html]);
+ $this->logger->debug('HTML after site config strings replacements', ['html' => $html]); // here the images are unwrapped for you still
// load and parse html
$parser = $this->siteConfig->parser();
@@ -208,13 +208,16 @@
$parser = $this->config['default_parser'];
}
- $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]);
+ $this->logger->info('Attempting to parse HTML with {parser}', ['parser' => $parser]); // here you have same libxml as me
$this->readability = $this->getReadability($html, $url, $parser, $this->siteConfig->tidy() && $smartTidy);
$tidied = $this->readability->tidied;
+ var_dump(['sc-tidy' => $this->siteConfig->tidy()]); // true for me
+ var_dump(['smarttidy' => $smartTidy]); // true for me
+ var_dump(['tidied' => $tidied]); // false for me
$this->logger->info('Body size after Readability: {length}', ['length' => \strlen($this->readability->dom->saveXML())]);
- $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]);
+ $this->logger->debug('Body after Readability', ['dom_saveXML' => $this->readability->dom->saveXML()]); // here the images are already wrapped for you!!!
// we use xpath to find elements in the given HTML document
$this->xpath = new \DOMXPath($this->readability->dom);
@@ -313,6 +316,7 @@
}
// strip elements (using xpath expressions)
+ var_dump(['strip' => $this->siteConfig->strip]);
foreach ($this->siteConfig->strip as $pattern) {
$this->logger->info('Trying {pattern} to strip element', ['pattern' => $pattern]);
$elems = $this->xpath->query($pattern, $this->readability->dom);
@@ -342,6 +346,7 @@
}
// strip images (using src attribute values)
+ var_dump(['strip_image_src' => $this->siteConfig->strip_image_src]);
foreach ($this->siteConfig->strip_image_src as $string) {
$string = strtr($string, ["'" => '', '"' => '']);
@@ -559,7 +564,7 @@
}
if ($detectBody && $success) {
- $this->logger->info('Detecting body');
+ $this->logger->info('Detecting body'); // this gets printed in your logs
$this->body = $this->readability->getContent();
if (1 === $this->body->childNodes->length && XML_ELEMENT_NODE === $this->body->firstChild->nodeType) {
@@ -573,6 +578,7 @@
}
}
+ // var_dump(['bodySet' => $this->body]);
if (isset($this->body)) {
// remove any h1-h6 elements that appear as first thing in the body
// and which match our title
@@ -603,6 +609,7 @@
}
// remove image lazy loading
+ var_dump(['lazy_load_attributes' => $this->config['src_lazy_load_attributes']]);
foreach ($this->body->getElementsByTagName('img') as $e) {
$hasAttribute = false;
foreach ($this->config['src_lazy_load_attributes'] as $attribute) {
@@ -611,6 +618,7 @@
}
}
+ var_dump(['hasAttribute' => $hasAttribute]);
if (false === $hasAttribute) {
continue;
}
@@ -620,6 +628,7 @@
// inside the data-lazy-src attribute. It also places the original image inside a noscript element
// next to the amended one.
// @see https://plugins.trac.wordpress.org/browser/lazy-load/trunk/lazy-load.php
+ var_dump(['nextSibling' => $e->nextSibling !== null ? $e->nextSibling->nodeName : 'null']);
if (null !== $e->nextSibling && 'noscript' === $e->nextSibling->nodeName) {
$newElem = $e->ownerDocument->createDocumentFragment();
$newElem->appendXML($e->nextSibling->innerHTML);
@@ -660,7 +669,7 @@
);
}
- $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]);
+ $this->logger->info('Success ? {is_success}', ['is_success' => $this->success]); // This gets prited as well.
return $this->success;
}
This code is particularly suspicious:
Here is the output after running that script: https://pastebin.com/zKj9VVbb I do not get anything in the logs though, should I also give you the logs after refreshing the source? In my case tidied is true.
I commented out the $graby->setLogger($log);
line in the code above so that the output does not get lost in the noise.
It would be useful to uncomment it and try to enable logging for readability itself:
--- a/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
+++ b/vendor/j0k3r/graby/src/Extractor/ContentExtractor.php
@@ -1008,7 +1008,7 @@
*/
private function getReadability($html, $url, $parser, $enableTidy)
{
- $readability = new Readability($html, $url, $parser, $enableTidy);
+ $readability = new Readability($html, $url, $parser, $enableTidy, $this->logger);
if (isset($this->config['readability']['pre_filters']) && \is_array($this->config['readability']['pre_filters'])) {
foreach ($this->config['readability']['pre_filters'] as $filter => $replacer) {
--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -177,14 +177,14 @@
* @param string $parser Which parser to use for turning raw HTML into a DOMDocument
* @param bool $use_tidy Use tidy
*/
- public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true)
+ public function __construct($html, $url = null, $parser = 'libxml', $use_tidy = true, $logger = null)
{
$this->url = $url;
$this->html = $html;
$this->parser = $parser;
$this->useTidy = $use_tidy && \function_exists('tidy_parse_string');
- $this->logger = new NullLogger();
+ $this->logger = $logger->withName('php-readability') ?: new NullLogger();
$this->loadHtml();
}
For me there are not really many php-readability logs before selfoss.DEBUG: Body after Readability
:
[2020-11-12 19:05:02] php-readability.DEBUG: Parsing URL: https://uxmovement.com/content/strong-layout-hierarchy-reduces-content-overload/
[2020-11-12 19:05:02] php-readability.DEBUG: Tidying document
Also turns out the tidied === true
is not it either, I installed php-tidy
in my work directory and it makes no difference (I had it on the web server before anyway).
Same in my case, not many readability logs before that one: https://pastebin.com/htYqhYhC
Looks like https://uxmovement.com/wp-content/uploads/2020/02/wireframe-tout.png is actually not wrapped in p
but https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png is – the latter is not followed by </p>
in HTML after site config strings replacements
but is followed by it in Body after Readability
.
We are running out of options, must be one of these options:
--- a/vendor/j0k3r/php-readability/src/Readability.php
+++ b/vendor/j0k3r/php-readability/src/Readability.php
@@ -1391,6 +1391,7 @@
mb_regex_encoding('UTF-8');
// HACK: dirty cleanup to replace some stuff; shouldn't use regexps with HTML but well...
+ var_dump(['pre_filters' => $this->pre_filters]);
if (!$this->flagIsActive(self::FLAG_DISABLE_PREFILTER)) {
foreach ($this->pre_filters as $search => $replace) {
$this->html = preg_replace($search, $replace, $this->html);
@@ -1410,7 +1411,10 @@
if ($this->useTidy) {
$this->logger->debug('Tidying document');
+ var_dump(['tidy_config' => $this->tidy_config]);
+ var_dump(['pre-tidy' => $this->html]);
$tidy = tidy_repair_string($this->html, $this->tidy_config, 'UTF8');
+ var_dump(['post-tidy' => $tidy]);
if (false !== $tidy && $this->html !== $tidy) {
$this->tidied = true;
$this->html = $tidy;
@@ -1431,6 +1435,7 @@
$this->dom = new \DOMDocument();
$this->dom->preserveWhiteSpace = false;
$this->dom->loadHTML($this->html, LIBXML_NOBLANKS | LIBXML_COMPACT | LIBXML_NOERROR);
+ var_dump(['post-load' => $this->dom->saveHTML()]);
libxml_use_internal_errors(false);
}
I would bet on the but it is weird that it would behave differently for us. Even though it is consistent for me on Nix with tidy 5.6.0 and Ubuntu. Ubuntu has that same version.
Here is the log https://pastebin.com/EKcxR5sB The problem does in fact appear to happen after tidy. I have installed the package php7.2-tidy from Ubuntu repository.
Looking at your config it is actually only the post-load
dump that adds the </p>
(look at the first instance of https://uxmovement.com/wp-content/uploads/2020/11/layout-scalebadge.png
in each text).
In fact diffing our dumps for these three steps show that only loading is different – tidy works the same for us.
Looking at the post-tidy
HTML in validator, it seems that your libxml is probably behaving correctly.
<noscript>
is supposed to have transparent content model, that is, only elements that would be allowed if the <noscript>
element was not there are allowed. Since the <noscript>
is inside <p>
, it cannot contain <p>
so when it occurs in there, the previous </p>
is closed (which must be before <noscript>
to avoid opening and closing tag crossing).
My parser seems to be unaware that paragraphs cannot contain paragraphs so it parses as if it were well-formed XML document.
Could you try to echo LIBXML_DOTTED_VERSION;
in the script? I have 2.9.10.
I get 2.9.4
So we actually have two bugs:
p > noscript > img
producing incorrect p > noscript > p > img
. We both have this bug.Seems to be something specific to our config, I can reproduce the tidy bug with:
<?php
$tidy_config = [
'tidy-mark' => false,
'vertical-space' => false,
'doctype' => 'omit',
'numeric-entities' => false,
// 'preserve-entities' => true,
'break-before-br' => false,
'clean' => true,
'output-xhtml' => true,
'logical-emphasis' => true,
'show-body-only' => false,
'new-blocklevel-tags' => 'article aside audio bdi canvas details dialog figcaption figure footer header hgroup main menu menuitem nav section source summary template track video',
'new-empty-tags' => 'command embed keygen source track wbr',
'new-inline-tags' => 'audio command datalist embed keygen mark menuitem meter output progress source time video wbr',
'wrap' => 0,
'drop-empty-paras' => true,
'drop-proprietary-attributes' => false,
'enclose-text' => true,
'enclose-block-text' => true,
'merge-divs' => true,
// 'merge-spans' => true,
'input-encoding' => '????',
'output-encoding' => 'utf8',
'hide-comments' => true,
];
$html = '<p><img src="foo.png" alt="" /><noscript><img src="foo.png" alt="" /></noscript></p>';
var_dump(tidy_repair_string($html, $tidy_config, 'UTF8'));
If I use empty config, it does not add the incorrect paragraph.
Edit: Looks like it is the enclose-block-text
.
Setting enclose-block-text to false does make a difference: the noscript is not enclosed in p anymore. However it is still present and the image still gets displayed twice. log: https://pastebin.com/bA1rtNEU
That looks like your libxml also does not want <noscript>
inside a paragraph so it closes it prematurely. This behaviour is not correct according to my reading of https://developer.mozilla.org/en-US/docs/Web/HTML/Element/noscript.
Geez, one would expect these ancient libraries to be without bugs.
Tried the tidy repro with tidy from the git repo and it still shown the same bug:
--- a/flake.nix
+++ b/flake.nix
@@ -15,12 +15,27 @@
outputs = { self, flake-compat, nixpkgs, utils }:
utils.lib.eachDefaultSystem (system:
let
- pkgs = nixpkgs.legacyPackages.${system};
+ pkgs = import nixpkgs.outPath {
+ inherit system;
+ overlays = [
+ (final: prev: {
+ html-tidy = prev.html-tidy.overrideAttrs (attrs: {
+ src = prev.fetchFromGitHub {
+ owner = "htacg";
+ repo = "tidy-html5";
+ rev = "188988022da4a64d80bc4a2eba21c33d57eb5152";
+ sha256 = "GRs5qF7OBv0CX/b8kLnnQsENZDjBRAn0uXJvWfJDQkg=";
+ };
+ });
+ })
+ ];
+ };
in {
devShell =
let
php = pkgs.php.withExtensions ({ enabled, all }: with all; enabled ++ [
imagick
+ tidy
]);
in
pkgs.mkShell {
Opened a bug against tidy: https://github.com/htacg/tidy-html5/issues/905
At this point I see these possible solutions:
html5lib
instead of libxml
, possibly with disabling enclose-block-text
.noscript
inside p
correctly.ContentExtractor
look for noscript
to parent node’s sibling as well.
I use the fulltext spout and in some feeds every image on the page is shown twice. This feed for reference: https://uxmovement.com/feed/