FreshRSS / Extensions

A repository containing all the official FreshRSS extensions
GNU Affero General Public License v3.0
378 stars 59 forks source link

ImageProxy: avoid oversight of wrapping an entire HTML document around image #212

Closed Frenzie closed 8 months ago

Frenzie commented 8 months ago

In reply to https://github.com/FreshRSS/Extensions/issues/206#issuecomment-1976649551.

With special thanks to this comment: https://www.php.net/manual/en/domdocument.savehtml.php#119767

Frenzie commented 8 months ago

Hm, I was just reading more and realized this may not be ideal, though at first glance the output looked okay:

For an example:

<h1>Foo</h1><p>bar</p>

will end up as:

<h1>Foo<p>bar</p></h1>

Easiest workaround is adding root tag yourself and stripping it later:

$html->loadHTML('<html>' . $content .'</html>', LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

$content = str_replace(array('<html>','</html>') , '' , $html->saveHTML());

Alkarex commented 8 months ago

I have not looked precisely, but a typical way is to saveHTML($node). Try to search our codebase for ->saveHTML( for some examples

Frenzie commented 8 months ago

That's the first thing I tried, as $dom->getElementsByTagName('body')->item(0)->saveHTML().

But I see what you're referring to is actually a for loop appending to a string. I had considered that, but it felt hacky.

Alkarex commented 8 months ago

I think it makes a difference in behaviour to pass the node inside the saveHTML($node) function

Alkarex commented 8 months ago

And no, without loop. See e.g.

https://github.com/FreshRSS/FreshRSS/blob/da43fff43713daadbf13d05557a2ba14d5166b11/lib/SimplePie/SimplePie/Sanitize.php#L468

Frenzie commented 8 months ago

Ah, in that case presumably it should simply be something like:

$output = $dom->getElementsByTagName('body')->item(0);
return $dom->saveHTML($output);

I'll test that tomorrow (if I find the time).

PS Here are the other results. I hadn't notice the third was slightly different, whoops. https://github.com/FreshRSS/FreshRSS/blob/da43fff43713daadbf13d05557a2ba14d5166b11/app/Models/Feed.php#L751-L756 https://github.com/FreshRSS/FreshRSS/blob/da43fff43713daadbf13d05557a2ba14d5166b11/app/Models/Entry.php#L727-L743

Frenzie commented 8 months ago

For reference, this is the idea I played with first yesterday:

        $body = $doc->getElementsByTagName('body')->item(0);

        $output = '';
        foreach ($body->childNodes as $node) {
                $output .= $doc->saveHTML($node);
        }

        return $output;

The sample uses regex to effect the desired result. I will do so as well to limit new string creation. https://github.com/FreshRSS/FreshRSS/blob/da43fff43713daadbf13d05557a2ba14d5166b11/lib/SimplePie/SimplePie/Sanitize.php#L470-L474