Escape HTML in doc generation

nobodywasishere commented 1 year ago

Bug Report

See https://github.com/kostya/lexbor/issues/25. The document generator didn't escape this div which led to a broken output. However, this leads to a bigger issue of being able to do cross-site scripting and other malicious attacks for anyone hosting code documentation - something I need to prevent when working on crystaldoc.info.

Minimum example:

module Example
  # Says "hello"
  # </div>
  # <script>alert("Hello world!");</script>
  def example_method
    puts "hello world"
  end
end

$ crystal -v

Crystal 1.9.2 (2023-07-21)

LLVM: 15.0.7
Default target: x86_64-pc-linux-gnu

nobodywasishere commented 1 year ago

It looks like there's a Markd option that could fix this if enabled:

https://github.com/crystal-lang/crystal/blob/d0ad0275eb9db4f9a40dffc8d9825eac46eeb36b/src/compiler/crystal/tools/doc/generator.cr#L323C35-L323C35

https://github.com/crystal-lang/crystal/blob/d0ad0275eb9db4f9a40dffc8d9825eac46eeb36b/lib/markd/src/markd/options.cr#L51C15-L51C15

I'll have to test and see what this outputs though to see if it's safe from XSS.

straight-shoota commented 1 year ago

I don't think the safe option will produce good results because IIUC it strips all HTML. And sometimes it's really useful to have more than the standard markdown feature set available. Of course, being safe should be a priority over extra functionality.

A more functional solution would be sanitizing the rendering product. https://github.com/straight-shoota/sanitize would be a Crystal library for this. I'm using it for presenting the readmes on https://shardbox.org/

I'm not sure if this library should be integrated in the doc generator, though. Usually you should be a ble to trust the code. But I think it should be possible to sanitize the resulting HTML files after the doc generator has run.

nobodywasishere commented 1 year ago

After discussions on the discord server, I think the best option is as follows:

Escape all html in doc comments (safe by default). This prevents accidental html tags from breaking the output, like what happened with lexbor.
Have a special delimiter to indicate tags or sections that shouldn't be escaped (maybe something like \<table> \</table> or unescaped { })
Have a basic html sanitizer that allows to pass-through only specific tags/attributes (like table, ul, etc)

Just having 1 may be good enough for the doc generator itself for normal use cases, as I don't think most people use html in their docs. 2 would be alright as long as it's documented somewhere that it is vulnerable to people injecting their own html/xss. 3 would be above and beyond, and would be the most secure, though it may require more work to add an html sanitizer and decide what should be allowed to pass through. I can always run my own sanitizer on the output as well.

nobodywasishere commented 1 year ago

After doing some investigation and testing, the only additional code necessary for (1) is adding this to the MarkdDocRenderer, which just override the HTMLRenderer methods but adding escapes:


  def text(node : Markd::Node, entering : Bool)
    output(escape(node.text))
  end

  def html_block(node : Markd::Node, entering : Bool)
    newline
    content = @options.safe? ? "<!-- raw HTML omitted -->" : escape(node.text)
    literal(content)
    newline
  end

  def html_inline(node : Markd::Node, entering : Bool)
    content = @options.safe? ? "<!-- raw HTML omitted -->" : escape(node.text)
    literal(content)
  end

Example:

straight-shoota commented 1 year ago

I don't understand step 2 of that plan. What difference does a special delimiter make? It doesn't really improve anything, just makes things more complex. I think raw HTML in markdown is already pretty explicit.

nobodywasishere commented 1 year ago

That's fair, that may not be the best solution then. It was just a way of both making the default case safe but allowing for unescaped html if necessary. I think it could be a good idea to put this behind a flag, like "--dont-escape-html" that would allow raw html passthrough if a project needs it for their own documentation, but escape all raw html otherwise.

While html in markdown can be explicit, it doesn't mean it can't go unnoticed (like with lexbor's documentation). One risk case is if a shard hosts its own documentation, and they don't look at a PR too closely, it could leave them vulnerable to a XSS attack. While the maintainers should do their best to vet PRs before merging, sometimes things slip through. I'm not sure if the benefits of unrestricted html in doc comments outweighs the risks.

straight-shoota commented 1 year ago

I diagree with that risk assessment.

The docs generator executes macros and can run arbitrary commands on the host. So there is a high requirement for trust when building API docs on your local computer. Unsanitized HTML is a far lesser threat than that.

Of course, building and viewing have different scopes, but the threat origin is the same: untrusted sources.

When building documentation for untrusted sources you would need to sandbox the generator or have some other means to prevent harmful effects. For most users who just want to build documentation for their own code or its trusted dependencies, this step is prohibitively complicated.

In the same way, HTML sanitization is only really necessary when building untrusted sources.

So I think it would be a good analogy to apply sanitizing as you do sandboxing. If the code is trusted, neither is necessary. If it's untrusted, both are.

straight-shoota commented 1 year ago

That being said:

I'm not opposed to sanitize the doc output out of the box. https://github.com/straight-shoota/sanitize shouldn't be hard to integrate into the doc generator. In contrast to sandboxing the entire process, it's basically invisible. There's no extra weight except for that one dependency and ~4 lines of code.

nobodywasishere commented 1 year ago

Sanitizing may work on the output, we'd just have to be extremely careful in terms of what we allow through. It wouldn't stop someone from breaking documentation generation though (intentionally or not). Would sanitize be added as another vendorized shard?

Personally I still think the best option (for my use cases and making things easier for me) would be having a flag that enables escaping HTML in comments (--escape-html). This escaping is reliant on being a part of the Markdown -> HTML conversion itself, and would be really complex / not as secure to implement as a pre or post processing step. I would either have to manually parse the crystal source code and escape characters in comments (but not ones in code blocks and other situations), use a custom version of the compiler with this escaping patch, or pull the documentation generator into its own tool. I'm trying to make a tool that's secure and trustworthy, and I need to patch as many of these holes as possible to make sure everything works correctly. I understand that my use-case is not typical though and I'll work within whatever you and this community think is the best for the compiler.

nobodywasishere commented 1 year ago

Including sanitize works really well, just needed to replace escape with a new sanitize method. I'm unfamiliar with the sanitization methods for your sanitize library so I stuck with common, which seems to work for basic script tags. This should work for what I need.

  def text(node : Markd::Node, entering : Bool)
    output(sanitize(node))
  end

  def html_block(node : Markd::Node, entering : Bool)
    newline
    content = @options.safe? ? "<!-- raw HTML omitted -->" : sanitize(node)
    literal(content)
    newline
  end

  def html_inline(node : Markd::Node, entering : Bool)
    content = @options.safe? ? "<!-- raw HTML omitted -->" : sanitize(node)
    literal(content)
  end

  def sanitize(node : Markd::Node) : String
    sanitizer = Sanitize::Policy::HTMLSanitizer.common
    sanitizer.process(node.text)
  end

straight-shoota commented 1 year ago

Looks great. Yes, HTMLSanitizer.common should be the right configuration for this job 👍

Would sanitize be added as another vendorized shard?

Yes, we need all the compiler's dependencies vendored in.

If you would like to prepare a PR, I'd really appreciate it.

nobodywasishere commented 1 year ago

I would be happy to. Thank you!

straight-shoota commented 5 months ago

Unfortunately, https://github.com/crystal-lang/crystal/pull/13786 added a dependency on libxml2 to the compiler. This library is currently unavailable in the build environment, so nightlies are breaking: https://app.circleci.com/pipelines/github/crystal-lang/crystal/14924/workflows/dc50e27f-1e26-4e77-adaa-a1fa925925a9 We need to revert it (#14595) to fix the builds.

We can introduce it again, but with precautions. Some ideas:

Update the build environments to include libxml2
Make the dependency optional (flag -Dwithout_libxml2) to work in minimal environments where the library is unavailable.
This flag could be added by default in the makefile, i.e. you'll need to opt in to using sanitization.
If the flag is used, we might want to show some kind of warning when generating docs.

nobodywasishere commented 5 months ago

Don't we already use/link against libxml2 for XML?

https://github.com/crystal-lang/crystal/blob/453a1591f7175aba81871b80411324929bd96cd5/NOTICE.md?plain=1#L34 https://github.com/crystal-lang/crystal/blob/453a1591f7175aba81871b80411324929bd96cd5/src/xml/libxml2.cr#L7

How do we already handle it currently with the XML::Reader and such?

Blacksmoke16 commented 5 months ago

Pretty sure that's for the stdlib if you require "xml". The compiler itself doesn't use libxml2 at all.

straight-shoota commented 5 months ago

Yes, this information is for the standard library.

https://github.com/crystal-lang/crystal/blob/453a1591f7175aba81871b80411324929bd96cd5/NOTICE.md?plain=1#L25

The compiler only uses parts of the standard library, and up until now does not use XML. So libxml2 is currently not neccary for building the compiler. But the santization code uses XML which introduces a new dependency for the compiler.

nobodywasishere commented 5 months ago

Personally, I don't have a problem with just requiring access to libxml2, but can understand that not being the most ideal solution.

Make the dependency optional (flag -Dwithout_libxml2) to work in minimal environments where the library is unavailable.

This flag could be added by default in the makefile, i.e. you'll need to opt in to using sanitization.

If the flag is used, we might want to show some kind of warning when generating docs.

This is probably the most sensible solution outside of requiring libxml2.

straight-shoota commented 4 months ago

I'm actually not sure anaymore that opting out by default is a good idea. Sorry for the inconvenience 🙈

We should be safe by default and if you want less safety, you should have to opt-out of it. Not the other way around.

We could consider doing this temporarily to softly introduce the new dependency. However, I doubt this is worth it. It's probably better to fail the build by default. This gives maintainers the option to either install libxml2 or opt out, which should not be much trouble.

This should all work fine. The only reason we had to revert #13786 is because our build environment needs to be prepared. But that should be easy enough.

straight-shoota commented 4 months ago

Another insight on this: By adding more libraries and features we're expanding the complexity of the compiler quite a bit. Escaping HTML is not a core feature of a program that's expected to compile source code.

We might want to consider splitting the doc generator out of the main compiler program. This had been mentioned previously in https://github.com/crystal-lang/crystal/issues/6947#issuecomment-561368164: The compiler is only needed for the part of the job that's directly related to parsing the source code. It can dump that information into a structured format, and a separate tool would build the HTML. I belief this would simplify a couple of things and would make it easier to add new features that require additional libraries (like this one).

nobodywasishere commented 4 months ago

I thought about that several times when developing crystaldoc.info, didn't only due to the complexity of the task and keeping up with updates (which we wouldn't run into if this were official).

~~Would people still be able to generate docs using crystal doc and have the complexity handled for them?~~

crysbot commented 4 months ago

This issue has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/the-weight-of-compiler-tools/6888/1

nobodywasishere commented 4 months ago

I think moving the doc generator into its own tool is the right way forward, but is going to take some time to flesh out and implement (several months to a year). In the meantime, I think it'd be better to merge the sanitization branch now, so it's out of the way as this refactoring is happening. What do you think (pending switching the default to libxml2 being on)?

crystal-lang / crystal

Escape HTML in doc generation #13753

Bug Report