apostrophecms / sanitize-html

Clean up user-submitted HTML, preserving whitelisted elements and whitelisted attributes on a per-element basis. Built on htmlparser2 for speed and tolerance
MIT License
3.68k stars 349 forks source link

Image alt attribute with empty string should not be removed #633

Closed compulim closed 9 months ago

compulim commented 9 months ago

PLEASE NOTE: make sure the bug exists in the latest patch level of the project. For instance, if you are running a 2.x version of Apostrophe, you should use the latest in that major version to confirm the bug.

To Reproduce

Step by step instructions to reproduce the behavior:

npm install sanitize-html@2.11.0
import sanitizeHtml from 'sanitize-html';

// EXPECT: <img alt="" src="https://example.com/" />
// ACTUAL: <img src="https://example.com/" />
console.log(
  sanitizeHtml('<img alt="" src="https://example.com/" />', {
    allowedAttributes: { img: ['alt', 'src'] },
    allowedTags: ['img']
  })
);

// EXPECT: <img alt="" src="https://example.com/" />
// ACUTAL: <img alt src="https://example.com/" />
console.log(
  sanitizeHtml('<img alt="" src="https://example.com/" />', {
    allowedAttributes: { img: ['alt', 'src'] },
    allowedTags: ['img'],
    nonBooleanAttributes: []
  })
);

Expected behavior

alt="" should be left as-is.

Describe the bug

Empty alt attribute means the alt text of the image is intentionally left blank. For example, border graphics may not need alt attribute if they are unimportant decoration.

When alt is an empty string, screen reader will skip the element and read nothing.

When alt is not set, screen reader will read "image".

So, alt="" is not same as alt or unset.

To workaround the issue, use <img role="presentation" src="https://example.com/" />.

Details

Version of Node.js: 18.16.0 PLEASE NOTE: Only stable LTS versions (10.x and 12.x) are fully supported but we will do our best with newer versions.

Server Operating System: Ubuntu 20.04 on WSL The server (which might be your dev laptop) on which Apostrophe is running. Linux? MacOS X? Windows? Is Docker involved?

Additional context: Seems regression of #529.

Add any other context about the problem here. If the problem is specific to a browser, OS or mobile device, specify which.

Screenshots If applicable, add screenshots to help explain your problem.

$ cat index.mjs
import sanitizeHtml from 'sanitize-html';

// EXPECT: <img alt="" src="https://example.com/" />
// ACTUAL: <img src="https://example.com/" />
console.log(
  sanitizeHtml('<img alt="" src="https://example.com/" />', {
    allowedAttributes: { img: ['alt', 'src'] },
    allowedTags: ['img']
  })
);

// EXPECT: <img alt="" src="https://example.com/" />
// ACUTAL: <img alt src="https://example.com/" />
console.log(
  sanitizeHtml('<img alt="" src="https://example.com/" />', {
    allowedAttributes: { img: ['alt', 'src'] },
    allowedTags: ['img'],
    nonBooleanAttributes: []
  })
);
$ node ./index.mjs
<img src="https://example.com/" />
<img alt src="https://example.com/" />
BoDonkey commented 9 months ago

How to work on this issue

Code suggestions for this issue

  1. Start by writing a test that should pass once your code changes are working. This will give you back info on how it is failing. You can use the examples that the person who opened this issue provided.
  2. Next, focus on the section(s) of code that are important for attributes. The code base is a little large, so don't get lost in it.
    • Should alt be a non-boolean attribute? Does it matter for this code change?
    • What line of code actually deletes the =""?
      1. Rather than letting any attribute have a value of ="", create a new option (potentially with default values) that lets the end user define what tags can have this value - like the allowedAttributes option.
      2. Make sure to write additional tests that show this option works!

Finally, please make use of the Discord to ask questions. Try to answer the questions using internet resources, but don't be afraid to ask questions on the Discord about anything. We are here to help!

zhna123 commented 9 months ago

Hi. I am new here and just had a look at this issue. The 2nd case above seems correct to me - alt is implicitly alt="". Is that right? (unless we actually want to specify alt="")

// EXPECT: <img alt="" src="https://example.com/" />
// ACUTAL: <img alt src="https://example.com/" />
console.log(
  sanitizeHtml('<img alt="" src="https://example.com/" />', {
    allowedAttributes: { img: ['alt', 'src'] },
    allowedTags: ['img'],
    nonBooleanAttributes: []
  })
);
BoDonkey commented 9 months ago

@zhna123 - There are cases where the image being added is purely decorative. In that case, we don't want explicit alt text. However, for your HTML to be in spec and pass validation, it needs an alt attribute. Leaving that attribute as a boolean without string will cause some screen readers to output "no alt text" and others will output the filename of the image. Both of these are disrupting to the UX of the page. So, the OP is correct, the alt tag should be allowed to be set to an empty string.

boutell commented 9 months ago

I agree it would make sense to fix this one specifically for the alt attribute.

On Tue, Sep 26, 2023 at 6:27 AM Robert Means @.***> wrote:

@zhna123 https://github.com/zhna123 - There are cases where the image being added is purely decorative. In that case, we don't want explicit alt text. However, for your HTML to be in spec and pass validation, it needs an alt attribute. Leaving that attribute as a boolean without string will cause some screen readers to output "no alt text" and others will output the filename of the image. Both of these are disrupting to the UX of the page. So, the OP is correct, the alt tag should be allowed to be set to an empty string.

— Reply to this email directly, view it on GitHub https://github.com/apostrophecms/sanitize-html/issues/633#issuecomment-1735261269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAH27OAQN37VYC6CLHLMETX4KUXZANCNFSM6AAAAAA4RB5HGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

--

THOMAS BOUTELL | CHIEF TECHNOLOGY OFFICER APOSTROPHECMS | apostrophecms.com | he/him/his

zhna123 commented 9 months ago

Okay. I used VoiceOver to test alt without explicitly specify ="", it didn't announce anything, but removing it will be "unlabeled image". But it's true that I can't guarantee all the screen reader will behave the same way.. I created a pr to fix this.