apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.68k stars 349 forks source link

Failing to parse large base64 encoded image url in an img element's src #619

Open hmaskat17 opened 1 year ago

hmaskat17 commented 1 year ago

To Reproduce

Step by step instructions to reproduce the behavior:

  1. Sanitize a string with a very large base64 encoded image url in the img element's src
  2. Allow attributes for img and src in options
  3. Returns string that contains an img element with missing url value i.e. it has no image anymore

Expected behavior

The encoded image url value should be contained in the returned string

Describe the bug

The sanitizeHtml function seems to discard long base64 encoded data when sanitizing and therefore returns a string with missing attribute values for the img, i.e. it returns an empty img element

Details

Version of Node.js: 18

Server Operating System: Windows

Additional context: There is no error object returned when fails to parse the image url

boutell commented 1 year ago

How large is large?

This could be an upstream limitation of htmlparser2, but I'm not casting blame, as I'm not 100% sure why there would be any limit there either. There is definitely no "if bytes more than X, reject it" policy in sanitize-html.

hmaskat17 commented 1 year ago

How large is large?

The encoded data is 172 KB large when copied over to Notepad and contains 177,112 characters. So i don't know if that is large for a raw base64 image but it is a long line of characters inside a html element. @boutell

boutell commented 1 year ago

It doesn't seem unreasonable to me. Can you create a PR adding a failing unit test?

hmaskat17 commented 1 year ago

Just as a notice, I will have to come back to this on a later date because of time constraints.

jzellis commented 1 year ago

I'm seeing the same issue -- even after making sure img is an allowed tag and src is an allowed attribute for img, it still removes the src entirely when I sanitize it.

boutell commented 1 year ago

Please provide a failing unit test in test/test.js so we can be sure we are talking about the same thing.

You can try out your tests with:

npm install
npm test