apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.83k stars 353 forks source link

sanitizeHtml throws TypeError on '&' symbol #606

Open matejfalat opened 1 year ago

matejfalat commented 1 year ago

To Reproduce

Step by step instructions to reproduce the behavior:

sanitizeHtml('<p>&</p>')
// or
sanitizeHtml('<p>&nbsp</p>')

Expected behavior

Not to crash.

Describe the bug

When the html text contains the ampersand symbol, the sanitizeHtml() is failing with:

Uncaught TypeError: Cannot read properties of undefined (reading '0')
    at Tokenizer.stateBeforeEntity (Tokenizer.js?6fbd:582:1)
    at Tokenizer.parse (Tokenizer.js?6fbd:818:1)
    at Tokenizer.write (Tokenizer.js?6fbd:158:1)
    at Parser.write (Parser.js?5804:459:1)
    at sanitizeHtml (index.js?5e22:578:1)
    at MaterialPreviewPage (MaterialPreviewPage.tsx?d2f2:41:55)

Details

React: 18.2.0, Webpack: 5.75.0

Version of Node.js: v18.13.0

Server Operating System: Windows 11, WSL2, and Docker

Screenshots

error1

error2

BoDonkey commented 1 year ago

Hi @matejfalat, I tested this on a mac using the repo tests and could not reproduce the error. Is this occurring only in the browser? I wonder if this is an htmlparser2 issue, rather than a sanitize-html issue since Tokenizer is part of that package. I guess we would need some minimal project set-up to replicate this error. Cheers

boutell commented 1 year ago

Yes, what we would ask is that you contribute a PR with a failing unit test to this repo so we can see how this is possible in the context of this project and avoid any confusion with issues that might only exist in a larger project with parts that aren't actually dependencies of the module etc. htmlparser2 is a dependency so browser or no, a bug coming from that should be reproducible in a test.

victorbojica commented 8 months ago

probably because of missing ";" at the end

NewEraCracker commented 3 months ago

On Node.js CLI:

> const sanitizeHtml = require('sanitize-html');
undefined
> sanitizeHtml('<p>&</p>')
'<p>&amp;</p>'
> sanitizeHtml('<p>&nbsp</p>')
'<p> </p>'
>

Just tested this and it works good on both of these:

Might be worth adding a test case. My two cents. Thank you.

PS: Found these related: https://github.com/fb55/htmlparser2/issues/1426 & https://github.com/fb55/htmlparser2/pull/1460