Open dca00 opened 2 years ago
Hi @dca00, I think you are looking for flattening. HtmlRuleSanitizer is a whitelisting sanitizer, so anything it does not know will be removed by default. By specifying the html
and body
tags are allowed but should be flattened, it should be possible to achieve what you want.
var san = HtmlSanitizer.SimpleHtml5Sanitizer();
foreach (var t in "p br i b tt strong".Split(" "))
{
san.Tag(t).RemoveEmpty();
}
san.Tag("body").Flatten();
san.Tag("html").Flatten();
var s = san.Sanitize("<html><script src=\"abc\"><body><p>ABC<b>abc</b><p>XYZ<b>xyz</p><u><li>abc<li>xyz</li></body></html>");
Optionally see https://github.com/Vereyon/HtmlRuleSanitizer/blob/master/Web.HtmlSanitizer.Tests/RuleTests.cs#L64 for matching test cases.
Do you not think that this defeats the purpose of loading a 3d party library, if I still have to hardcode elements to be handled? Just thinking out loud: what else would I have to flatten? Here we go: divs, tables, theads, trs, tds, iframes, spans, and a myriad of those fancy new elements that the advertising industry is coming up with, virtually daily. Am I missing anything?
Do note that the HtmlSanitizer.SimpleHtml5DocumentSanitizer()
helper method was created exactly for this purpose; it allows full documents including the html
and body
tag. Also the div
tag is included to be whitelisted in both HtmlSanitizer.SimpleHtml5DocumentSanitizer()
and HtmlSanitizer.SimpleHtml5Sanitizer()
.
You are correct that various other HTML5 elements might be worth considering for inclusion, like section
, article
, etc.
returns an empty string. Does your class sanitize not HTML documents but HTML fragments? This is not very useful when HTML comes from external sources beyond our control because it would then require preliminary stripping of
<html>, <head>, <body>
etc containers.