flavorjones / loofah

Ruby library for HTML/XML transformation and sanitization
MIT License
934 stars 137 forks source link

[draft] default to html5 parsing #239

Closed flavorjones closed 1 year ago

flavorjones commented 2 years ago

The libgumbo parser used by Nokogiri::HTML5 is superior to the libxml2 parser used by Nokogiri::HTML4 (the default).

This is a draft pull request to see how hard it would be to default to use that parser for Loofah's sanitization, and evaluate what changes might be breaking to the many Rails apps that use it.

Note that Loofah can only support HTML5 in Nokogiri >= 1.14.0 because it requires the subclassing fix at https://github.com/sparklemotion/nokogiri/commit/ebde7dae770aaefa667ee02ac57a2c0bfed1164c

See a related but orthogonal issue to default Nokogiri to HTML5: https://github.com/sparklemotion/nokogiri/issues/2331

flavorjones commented 2 years ago

See downstream https://github.com/rails/rails-html-sanitizer/pull/133 for an indication of impact of this change

flavorjones commented 2 years ago

Running some large Rails apps' CI with the HTML5 parser, found another notable behavioral difference that didn't show up in this test suite: binary attributes which don't have values in the libxml2 HTML4 parser (<option disabled>) but do in the libgumbo HTML5 parser (<option disabled="">).

flavorjones commented 2 years ago

I'd love to get some benchmarks just to understand the impact of making this change. It's unlikely to change my mind but would be good information to have handy.

flavorjones commented 2 years ago

I've run three large rails apps at Shopify through CI with this branch of Loofah and (except for three tests that were relying on the above binary attribute behavior) everything was green. This is encouraging.

DanielHeath commented 1 year ago

Would it make sense to get this merged with the default set to the existing HTML4, so apps can opt in to the HTML5 behavior and start testing? That would presumably make this a far safer change to release.

flavorjones commented 1 year ago

@DanielHeath thanks for asking. There's a bit of a yak shave of dependencies here.

Loofah can only support HTML5 in Nokogiri >= 1.14.0 because it requires the subclassing fix at sparklemotion/nokogiri@ebde7da

Hoping to get that nokogiri release out in an RC this weekend if I can finish Ruby 3.2 support in rake-compiler-doc.

flavorjones commented 1 year ago

I picked this back up again this week, now that Nokogiri 1.14.0.rc1 has been prereleased. This work is mostly done, though there are still details to work through here and in rails-html-sanitizer.

flavorjones commented 1 year ago

Closing this in favor of #261 which has a much better API for introducing HTML5 support.