Vereyon / HtmlRuleSanitizer

A rule based HTML sanitizer built on top of the HTML Agility pack
MIT License
63 stars 19 forks source link

Fails a basic test case #28

Open dca00 opened 2 years ago

dca00 commented 2 years ago
            var san = HtmlSanitizer.SimpleHtml5Sanitizer();
            foreach (var t in "p br i b tt strong".Split(" "))
            {
                san.Tag(t).RemoveEmpty();
            }
            var s = san.Sanitize("<html><script src=\"abc\"><body><p>ABC<b>abc</b><p>XYZ<b>xyz</p><u><li>abc<li>xyz</li></body></html>");

returns an empty string. Does your class sanitize not HTML documents but HTML fragments? This is not very useful when HTML comes from external sources beyond our control because it would then require preliminary stripping of <html>, <head>, <body> etc containers.

cakkermans commented 2 years ago

Hi @dca00, I think you are looking for flattening. HtmlRuleSanitizer is a whitelisting sanitizer, so anything it does not know will be removed by default. By specifying the html and body tags are allowed but should be flattened, it should be possible to achieve what you want.

var san = HtmlSanitizer.SimpleHtml5Sanitizer();
foreach (var t in "p br i b tt strong".Split(" "))
{
    san.Tag(t).RemoveEmpty();
}
san.Tag("body").Flatten();
san.Tag("html").Flatten();
var s = san.Sanitize("<html><script src=\"abc\"><body><p>ABC<b>abc</b><p>XYZ<b>xyz</p><u><li>abc<li>xyz</li></body></html>");

Optionally see https://github.com/Vereyon/HtmlRuleSanitizer/blob/master/Web.HtmlSanitizer.Tests/RuleTests.cs#L64 for matching test cases.

dca00 commented 2 years ago

Do you not think that this defeats the purpose of loading a 3d party library, if I still have to hardcode elements to be handled? Just thinking out loud: what else would I have to flatten? Here we go: divs, tables, theads, trs, tds, iframes, spans, and a myriad of those fancy new elements that the advertising industry is coming up with, virtually daily. Am I missing anything?

cakkermans commented 1 year ago

Do note that the HtmlSanitizer.SimpleHtml5DocumentSanitizer() helper method was created exactly for this purpose; it allows full documents including the html and body tag. Also the div tag is included to be whitelisted in both HtmlSanitizer.SimpleHtml5DocumentSanitizer() and HtmlSanitizer.SimpleHtml5Sanitizer().

You are correct that various other HTML5 elements might be worth considering for inclusion, like section, article, etc.