Escape HTML characters in list of suggestions

dontcallmedom commented 5 years ago

fix #16932

Generate manual DOM subtree for each marked instance of the queried string instead of generating an unsafe HTML string

dontcallmedom commented 5 years ago

I have hopefully adjusted the code to the coding style.

I may be missing something for the replace function - while it would work fine to generate the mark elements, I don't see how to use it succinctly to escape the HTML characters that would be outside of the matched parts of the string.

LeaVerou commented 5 years ago

I have hopefully adjusted the code to the coding style.

Thanks!

I may be missing something for the replace function - while it would work fine to generate the mark elements, I don't see how to use it succinctly to escape the HTML characters that would be outside of the matched parts of the string.

That made me realize that as currently written, this is backwards incompatible: HTML in options works right now, whereas if this is merged, it would stop working and possibly break numerous use cases. Ideally we want something that keeps HTML that was there from the beginning, but doesn't convert special characters to HTML when highlighting them. I.e. something that's XSS-safe, but doesn't restrict what you can do in the initial options. Does that make sense?

dontcallmedom commented 5 years ago

re backwards-compatibility, I don't think it can be preserved:

right now, the API is ambiguous about the type of content in label - when rendered in the list options, it is rendered as HTML, but when rendered in the <input> field, it is rendered as plain text (since it goes through the value attribute).
- it's not really a matter of highlighting the content of the label with <mark> - the XSS risks exists even when no highlighting happens - if the list of label options contains a malicious <script> (e.g. coming from a third party ajax service, or from an http query parameters), that script will be loaded as soon as the list of labels is displayed

I guess we could make the new behavior opt-in - but given the XSS risks and the existing ambiguity, it seems to me making it the default would be better. That being said, I could imagine making it possible to opt-out of the new behavior (e.g. through a new option) - this would imply adapting the way the labels gets displayed in the <input> to get only the textContent equivalent.

LeaVerou commented 5 years ago

right now, the API is ambiguous about the type of content in label - when rendered in the list options, it is rendered as HTML, but when rendered in the <input> field, it is rendered as plain text (since it goes through the value attribute).

I believe that's fine, e.g. many use cases are about displaying graphics next to the suggestions.

it's not really a matter of highlighting the content of the label with <mark> - the XSS risks exist even when no highlighting happens - if the list of label options contains a malicious <script> (e.g. coming from a third party ajax service, or from an http query parameters), that script will be loaded as soon as the list of labels is displayed

Good point, that should be prevented (though I'm still not sure if it should be the default at this point). However, if there's HTML in the initial options, included in the page's markup, that should be preserved, no?

I noticed that the current behavior is definitely buggy: If you include <b> in data-list, it will be rendered (correct). If you include <b>, it will still be rendered! That's definitely a bug which needs to be fixed in both cases (regardless of someone opts in to allowing HTML or not).

I guess we could make the new behavior opt-in - but given the XSS risks and the existing ambiguity, it seems to me making it the default would be better. That being said, I could imagine making it possible to opt-out of the new behavior (e.g. through a new option) - this would imply adapting the way the labels gets displayed in the <input> to get only the textContent equivalent.

I wonder if it would be an easier transition to make it opt-in in this version, and announce that it's going to be default behavior in the next one, and people should use the option now if they need it.

dontcallmedom commented 5 years ago

I have looked a bit more into this, and I think there are the following bugs in the way HTML values or labels are handled:

if you import your list of values/labels from in-markup page, only the text is imported, not the markup; but as you pointed out, that text itself is then interpreted as HTML in the list of suggestions (i.e. <b> gets rendered as a <b> tag)
if you use markup in the list of suggestions, the characters of the markup are included in the matching (e.g. <strong> will be matched when searching str); when matched, the markup gets messed up. Conversely, if the list of suggestions includes in-word markup, the word won't be matched (e.g. <strong>J</strong>avascript, then typing java won't list Javascript as a suggestion)
as I described above, markup used in the list of suggestions doesn't get displayed properly when rendered in the input box
as described above, the combination of the ambiguous API and parsing bugs create XSS risks (e.g. when the suggestions come from unfiltered external input)

Here is how I think HTML in the list of suggestions could be properly supported with reduced XSS risks:

suggestions brought in by text should always be assumed to be text only (e.g. when declared in data-list)
suggestions that need markup rendering should be provided as DOM fragments - which could be done either when they're imported via the CSS selector syntax, or could be done when passed as arrays in JS (e.g. list: [ node1, node2]) - the assumption being that it would then be the responsibility of the developer to produce these fragments from strings if that's the behavior they want (e.g. with DOMParser or innerHTML) - that way, developers that are aware of the risk of innerHTML will have greater awareness of possible XSS risks if the strings are unsafe

This would imply adopting the code to handle differently text and DOM fragments, and converting DOM fragments to text via textContent when setting input.value and when matching user input. I think (but am not sure) the highlight of matching characters in DOM fragments could still be achieved with the Range API.

What do you think? (I'm leaving aside the backwards-compatibility question for now - I feel it will be easier to consider it once the end goal is clearer)

LeaVerou / awesomplete

Escape HTML characters in list of suggestions #17189