OWASP / java-html-sanitizer

Takes third-party HTML and produces HTML that is safe to embed in your web application. Fast and easy to configure.
Other
843 stars 213 forks source link

'-' and '_' may be treated in plain letters. #198

Closed yangbongsoo closed 4 years ago

yangbongsoo commented 4 years ago

193 becaue of _ char, order string find it in ENTITY_TRIE and replace

I think - and _ may be treated in plain letters. Please tell me if you think I'm wrong or if I lack a test.

mikesamuel commented 4 years ago

Hmm. Since this was initially written, https://html.spec.whatwg.org/#named-character-references clarified which named character references are allowed without a trailing semicolon.

I believe, according to html.spec.whatwg, that &curren is allowed without a semicolon because there are two entries in that table, one of which does not have a trailing ;:

Name Character Glyph
curren; U+000A4 ¤
curren U+000A4 ¤

Maybe we should just derive a list of HTML entities that are allowed without semicolons instead of looking for extra letters.

mikesamuel commented 4 years ago

Does https://github.com/OWASP/java-html-sanitizer/pull/201 do what you need?

yangbongsoo commented 4 years ago

yes. I close this PR.