apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.79k stars 353 forks source link

Trailing text is not preserved #506

Closed IvanPizhenko closed 2 years ago

IvanPizhenko commented 2 years ago

To Reproduce

  1. Get sample project that demonstrates bug (2 files) from here: https://gist.github.com/IvanPizhenko/100c92f523a994fd425ff8f16420f12d
  2. Run npm install
  3. Run node index.js

Expected behavior

I expect trailing text Above is preserved, i.e. sanitized output should be:

Below
<span>[image: image.png]</span>
Above

Describe the bug

Trailing text Above is lost, i.e. I am getting:

Below
<span>[image: image.png]</span>

Details

Version of Node.js: v12.22.6

Server Operating System: Linux

Additional context: I am trying to upgrade sanitize-html from version 1.21.1 to latest 2.5.2 (at the moment of submitting this issue) and getting the above described issue. According to my research (I have tried multiple versions) bug has appeared in version 1.27.3.

tomholub commented 2 years ago

Seems to be a regression. Maintainers: could you point us in the right direction as far as any guesses on where to look? We may attempt a fix.

boutell commented 2 years ago

It sounds like it might be an upstream issue with the htmlparser2 module. Might be — haven't delved into it.

alex-rantos commented 2 years ago

Stumble upon the issue myself when upgraded to latest from 1.22.

It's a bug on transformTags on text property which replaces content. (fyi: @tomholub)

IvanPizhenko commented 2 years ago

@alex-rantos Do you know how to fix this?

alex-rantos commented 2 years ago

Not really I just narrowed down the repro steps a bit further and shared it here. However, I will try to give it a go once I find some free time but I cannot promise anything - not familiar with transformTags flow.

alex-rantos commented 2 years ago

It's a regression from this commit https://github.com/alex-rantos/sanitize-html/commit/b218a722a3706de54b1ddcca0377e131102ec8fd

@abea the following condition fails to append the trailing text as addedText = true in this case.

else if (!addedText) {
  result += escaped;
}

In case you take a look this is the test case that fails:

  it('should preserve trailing text when replacing the tagName and adding new text via transforming function', function () {
    assert.equal(sanitizeHtml('<p>text before <br> text after</p>', {
      transformTags: {
        br: function (_tagName, _attribs) {
          return {
            tagName: 'span',
            text: ' '
          };
        }
      }
    }), '<p>text before <span> </span> text after</p>');
  });
IvanPizhenko commented 2 years ago

Works correctly in the version 2.6.1. Thank you for the fix!