flavorjones / loofah

Ruby library for HTML/XML transformation and sanitization
MIT License
934 stars 137 forks source link

Pass block to document fragment instantiation #282

Open troym9731 opened 7 months ago

troym9731 commented 7 months ago

Hi! I recently ran into a bizarre issue where some HTML I'm sanitizing was so deeply nested, that Nokogiri was silently failing with this error (#<Nokogiri::XML::SyntaxError: 1153:513: FATAL: Excessive depth in document: 256 use XML_PARSE_HUGE option>) and simply returning empty tags.

I'm looking into ways to potentially mitigate receiving HTML this deeply nested, but it also seems like it would be great if I could pass this option to Nokogiri to parse the HTML without losing a bunch of data.

I'm unsure if this would be the preferred way to introduce passing options from Loofah directly to Nokogiri, or if adding this sort of capability is something Loofah even wants to do, but I figured this would be a smaller change than adding an additional argument.

Anyway, thanks for all the work y'all do on this project. The company I work for has been using this for many years, and it continues to serve us well!

flavorjones commented 7 months ago

@troym9731 Thanks for starting this discussion! I want to think about this for a little bit.

Here's the key point I want to consider: Loofah is, almost by definition, used to sanitize untrusted content. The libxml2 parser has hard-coded limits in document length and depth (limits that are turned off by the HUGE parse option) specifically to mitigate denial-of-service threats from attackers who might feed your application overly-long or -deep documents in an attempt to cause OOM or excessive CPU usage.

And so I'm not sure whether offering the ability to set parse options is just a footgun that I'd want to avoid.

Can you help me understand a bit more about your use case? Is this truly an untrusted document that might turn into an attack vector if you set the HUGE option?

troym9731 commented 7 months ago

Thanks for the response! That was incredibly quick haha.

Can you help me understand a bit more about your use case? Is this truly an untrusted document that might turn into an attack vector if you set the HUGE option?

I would say technically yes, it's an untrusted document in the sense a malicious actor could send something so large it could cause issues. In practice, it's not something we're concerned about. The document almost always comes from a WYSIWYG editor in our platform, although someone could hit the API directly. Our whole platform requires authentication, so there isn't a public-facing API that could get abused by this.

We're unsure why this HTML payload was so incredibly nested, but we think it could be one of two things: The individual copied their HTML (the HTML is for written content such as blog posts or ebooks) from an external source, and pasted it into our editor and sent it over (this is quite common as people tend to prefer Google docs), or something about our editor on mobile causes it to create unnecessary wrapping tags depending on the mobile OS.

We're still looking into the root cause here, but if some writers on our platform are copying/pasting from some sort of external writing tool that has awfully-nested HTML like this, then in some form or another we'll have to figure out how to support it without writers losing their work.

And so I'm not sure whether offering the ability to set parse options is just a footgun that I'd want to avoid.

Completely understandable. We could put it behind a named argument that illustrates its potential security hole? Or potentially even only allowing for a subset of parse options provided by Nokogiri. But either of those might be a breaking change, and I'm unsure that's worth it considering Loofah has been around for awhile and no one has mentioned this before haha.