ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.07k stars 327 forks source link

The library purified the img tags, no matter what... #255

Closed totumfaktum88 closed 4 years ago

totumfaktum88 commented 4 years ago

Hi there,

i started to use this library from composer: : ^4.12 - PHP 7.3.

I tried clean multiple html files (generated with libreoffice), what contains images with similar external http/https urls and data URI. All of this images removed. Can you help me?

I tried multiple of configs but not working.

Here is the last one config:

<?php
$purifierConfig = \HTMLPurifier_Config::createDefault();
                $purifierConfig->set('HTML.Doctype', 'HTML 4.01 Transitional');
                $purifierConfig->set('HTML.Allowed', 'a[href|title],img[src],em,strong,cite,blockquote,code,ul,ol,li,dl,dt,dd,p,br,h1,h2,h3,h4,h5,h6,span,*[style]');
                $purifierConfig->set('HTML.Trusted', true);
                $purifierConfig->set('URI.AllowedSchemes', [
                    "data" => true,
                    "http" => true,
                    "https" => true,
                    "file" => true,
                ]);

What's missing?

Img tags looks like from generated source: `

`

Thx for your help!

ezyang commented 4 years ago

You need to allow alt as an attribute on img, as it is a required attribute

totumfaktum88 commented 4 years ago

Same result from this source:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>

</head>
<body>
<table>
    <tr>
        <td><p>
            <b>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam nunc justo, blandit a velit at, volutpat malesuada lectus. Sed non nunc non augue iaculis facilisis et et sem. </b></p>
        </td>
    </tr>
    <tr>
        <td><p>
            <img alt="lorem ipsum" src="https://image.shutterstock.com/image-vector/sample-stamp-grunge-texture-vector-260nw-1389188336.jpg"/>
</p>
        </td>
    </tr>
    <tr>
        <td>
            <p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam nunc justo, blandit a velit at, volutpat malesuada lectus. Sed non nunc non augue iaculis facilisis et et sem. 
            </p>
        </td>
    </tr>
</table>
</p>
</body>
</html>

<p>Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam nunc justo, blandit a velit at, volutpat malesuada lectus. Sed non nunc non augue iaculis facilisis et et sem.</p>
<p></p>
<p>
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam nunc justo, blandit a velit at, volutpat malesuada lectus. Sed non nunc non augue iaculis facilisis et et sem.
</p>
ezyang commented 4 years ago

What is your update on config

totumfaktum88 commented 4 years ago

What is your update on config

<?php
$purifierConfig = \HTMLPurifier_Config::createDefault();
$purifierConfig->set('HTML.Doctype', 'HTML 4.01 Transitional');
$purifierConfig->set('HTML.Allowed', 'a[href|title],img[src|alt],em,strong,cite,blockquote,code,ul,ol,li,dl,dt,dd,p,br,h1,h2,h3,h4,h5,h6,span,*[style]');
 $purifierConfig->set('URI.AllowedSchemes', [
  "data" => true,
   "http" => true,
   "https" => true,
  "file" => true,
]);
ezyang commented 4 years ago

If you print your input HTML from PHP before purifying it what does it say?

totumfaktum88 commented 4 years ago

Uhh god. I found a regex replace what preprocess the source before run the purifier. After removed this section the library worked well.

Thanks for your help!