ezyang / htmlpurifier

Standards compliant HTML filter written in PHP
http://htmlpurifier.org
GNU Lesser General Public License v2.1
3.07k stars 327 forks source link

[Enduser Customize] - Can not allow heading elements (h1, h2, h3, h4, h5, h6) as Block elements #390

Closed lenhatthanh20 closed 10 months ago

lenhatthanh20 commented 10 months ago

Dear team,

Following the enduser customize docs here: http://htmlpurifier.org/docs/enduser-customize.html I'm trying to custom heading elements (h1, h2, h3, h4, h5, h6) following my code:

$config = HTMLPurifier_Config::createDefault();
$config->set('HTML.DefinitionID', 'enduser-customize.html tutorial');
$config->set('HTML.DefinitionRev', 1);
$config->set('Cache.DefinitionImpl', null); // remove this later!
if ($def = $config->maybeGetRawHTMLDefinition()) {
    $def->addElement('h1', 'Block', 'Flow', 'Common');
    $def->addElement('h2', 'Block', 'Flow', 'Common');
    $def->addElement('h3', 'Block', 'Flow', 'Common');
    $def->addElement('h4', 'Block', 'Flow', 'Common');
    $def->addElement('h5', 'Block', 'Flow', 'Common');
    $def->addElement('h6', 'Block', 'Flow', 'Common');
    $def->addElement('table', 'Block', 'Flow', 'Common');
    $def->addElement('tbody', 'Block', 'Flow', 'Common');
    $def->addElement('tr', 'Block', 'Flow', 'Common');
    $def->addElement('td', 'Block', 'Flow', 'Common');
    $def->addElement('span', 'Block', 'Flow', 'Common');
    $def->addElement('a', 'Block', 'Flow', 'Common');
}

$purifier = new HTMLPurifier($config);
$purifier->purify($dirty_html);

And I made a unit test with below input (dirty html):

<h1>
  <h2>
    <h3>
      <h4>
        <h5>
          <h6>
            <table>
              <tbody>
                <tr>
                  <td>
                    <span>
                      <a>Testing</a>
                    </span>
                  </td>
                </tr>
              </tbody>
            </table>
          </h6>
        </h5>
      </h4>
    </h3>
  </h2>
</h1>

And I expect that the heading elements will allowtable, tbody,tr, td, span, a elements as their children. But the actual result is not my expected.

My expected:

<h1>
  <h2>
    <h3>
      <h4>
        <h5>
          <h6>
            <table>
              <tbody>
                <tr>
                  <td>
                    <span>
                      <a>Testing</a>
                    </span>
                  </td>
                </tr>
              </tbody>
            </table>
          </h6>
        </h5>
      </h4>
    </h3>
  </h2>
</h1>

But the actual result is:

<h1>
  <h2>
    <h3>
      <h4>
        <h5>
          <h6>
            </h6></h5></h4></h3></h2></h1><table>
              <tbody>
                <tr>
                  <td>
                    <span>
                      <a>Testing</a>
                    </span>
                  </td>
                </tr>
              </tbody>
            </table>

And there is something very strange. When I enable Core.CollectErrors, my test is passed. I don't know why. $config->set('Core.CollectErrors', true);

Kindly help to take a look of this issue. Thank you so much for your support.

lenhatthanh20 commented 10 months ago

More info: When I enable the config Core.CollectErrors. The Lexer implementation will be DirectLex (auto detect) at this line: https://github.com/ezyang/htmlpurifier/blob/master/library/HTMLPurifier/Lexer.php#L99-L100

And when disable Core.CollectErrors, the Lexer implementation is DOMLex in my application.

I think that I can use the DirectLex implementation with the config Core.LexerImpl to fix this issue. But do you have any recommendations for me? Because I think that the DOMLex seem to be better than DirectLex.

Thank you.

ezyang commented 10 months ago

Not much you can do about it. DOMLex uses libxml, so you get whatever the C code does, it's not really customizable. To have a speedy lexer someone has to write a C extension for it. Maybe in 2023 someone has written it and I just don't know about it lol.

lenhatthanh20 commented 10 months ago

Thank you for your response. What do you think if I hard config Core.LexerImpl always use DirectLex ?

ezyang commented 10 months ago

Yeah you can just change the config directly