Open nobodywasishere opened 1 year ago
It looks like there's a Markd option that could fix this if enabled:
I'll have to test and see what this outputs though to see if it's safe from XSS.
I don't think the safe
option will produce good results because IIUC it strips all HTML. And sometimes it's really useful to have more than the standard markdown feature set available.
Of course, being safe should be a priority over extra functionality.
A more functional solution would be sanitizing the rendering product. https://github.com/straight-shoota/sanitize would be a Crystal library for this. I'm using it for presenting the readmes on https://shardbox.org/
I'm not sure if this library should be integrated in the doc generator, though. Usually you should be a ble to trust the code. But I think it should be possible to sanitize the resulting HTML files after the doc generator has run.
After discussions on the discord server, I think the best option is as follows:
\<table> \</table>
or unescaped { }
)Just having 1 may be good enough for the doc generator itself for normal use cases, as I don't think most people use html in their docs. 2 would be alright as long as it's documented somewhere that it is vulnerable to people injecting their own html/xss. 3 would be above and beyond, and would be the most secure, though it may require more work to add an html sanitizer and decide what should be allowed to pass through. I can always run my own sanitizer on the output as well.
After doing some investigation and testing, the only additional code necessary for (1) is adding this to the MarkdDocRenderer, which just override the HTMLRenderer methods but adding escapes:
def text(node : Markd::Node, entering : Bool)
output(escape(node.text))
end
def html_block(node : Markd::Node, entering : Bool)
newline
content = @options.safe? ? "<!-- raw HTML omitted -->" : escape(node.text)
literal(content)
newline
end
def html_inline(node : Markd::Node, entering : Bool)
content = @options.safe? ? "<!-- raw HTML omitted -->" : escape(node.text)
literal(content)
end
Example:
I don't understand step 2 of that plan. What difference does a special delimiter make? It doesn't really improve anything, just makes things more complex. I think raw HTML in markdown is already pretty explicit.
That's fair, that may not be the best solution then. It was just a way of both making the default case safe but allowing for unescaped html if necessary. I think it could be a good idea to put this behind a flag, like "--dont-escape-html" that would allow raw html passthrough if a project needs it for their own documentation, but escape all raw html otherwise.
While html in markdown can be explicit, it doesn't mean it can't go unnoticed (like with lexbor's documentation). One risk case is if a shard hosts its own documentation, and they don't look at a PR too closely, it could leave them vulnerable to a XSS attack. While the maintainers should do their best to vet PRs before merging, sometimes things slip through. I'm not sure if the benefits of unrestricted html in doc comments outweighs the risks.
I diagree with that risk assessment.
The docs generator executes macros and can run arbitrary commands on the host. So there is a high requirement for trust when building API docs on your local computer. Unsanitized HTML is a far lesser threat than that.
Of course, building and viewing have different scopes, but the threat origin is the same: untrusted sources.
When building documentation for untrusted sources you would need to sandbox the generator or have some other means to prevent harmful effects. For most users who just want to build documentation for their own code or its trusted dependencies, this step is prohibitively complicated.
In the same way, HTML sanitization is only really necessary when building untrusted sources.
So I think it would be a good analogy to apply sanitizing as you do sandboxing. If the code is trusted, neither is necessary. If it's untrusted, both are.
That being said:
I'm not opposed to sanitize the doc output out of the box. https://github.com/straight-shoota/sanitize shouldn't be hard to integrate into the doc generator. In contrast to sandboxing the entire process, it's basically invisible. There's no extra weight except for that one dependency and ~4 lines of code.
Sanitizing may work on the output, we'd just have to be extremely careful in terms of what we allow through. It wouldn't stop someone from breaking documentation generation though (intentionally or not). Would sanitize be added as another vendorized shard?
Personally I still think the best option (for my use cases and making things easier for me) would be having a flag that enables escaping HTML in comments (--escape-html
). This escaping is reliant on being a part of the Markdown -> HTML conversion itself, and would be really complex / not as secure to implement as a pre or post processing step. I would either have to manually parse the crystal source code and escape characters in comments (but not ones in code blocks and other situations), use a custom version of the compiler with this escaping patch, or pull the documentation generator into its own tool. I'm trying to make a tool that's secure and trustworthy, and I need to patch as many of these holes as possible to make sure everything works correctly. I understand that my use-case is not typical though and I'll work within whatever you and this community think is the best for the compiler.
Including sanitize works really well, just needed to replace escape
with a new sanitize
method. I'm unfamiliar with the sanitization methods for your sanitize library so I stuck with common, which seems to work for basic script tags. This should work for what I need.
def text(node : Markd::Node, entering : Bool)
output(sanitize(node))
end
def html_block(node : Markd::Node, entering : Bool)
newline
content = @options.safe? ? "<!-- raw HTML omitted -->" : sanitize(node)
literal(content)
newline
end
def html_inline(node : Markd::Node, entering : Bool)
content = @options.safe? ? "<!-- raw HTML omitted -->" : sanitize(node)
literal(content)
end
def sanitize(node : Markd::Node) : String
sanitizer = Sanitize::Policy::HTMLSanitizer.common
sanitizer.process(node.text)
end
Looks great. Yes, HTMLSanitizer.common
should be the right configuration for this job 👍
Would sanitize be added as another vendorized shard?
Yes, we need all the compiler's dependencies vendored in.
If you would like to prepare a PR, I'd really appreciate it.
I would be happy to. Thank you!
Unfortunately, https://github.com/crystal-lang/crystal/pull/13786 added a dependency on libxml2
to the compiler.
This library is currently unavailable in the build environment, so nightlies are breaking: https://app.circleci.com/pipelines/github/crystal-lang/crystal/14924/workflows/dc50e27f-1e26-4e77-adaa-a1fa925925a9
We need to revert it (#14595) to fix the builds.
We can introduce it again, but with precautions. Some ideas:
libxml2
-Dwithout_libxml2
) to work in minimal environments where the library is unavailable.Don't we already use/link against libxml2 for XML?
https://github.com/crystal-lang/crystal/blob/453a1591f7175aba81871b80411324929bd96cd5/NOTICE.md?plain=1#L34 https://github.com/crystal-lang/crystal/blob/453a1591f7175aba81871b80411324929bd96cd5/src/xml/libxml2.cr#L7
How do we already handle it currently with the XML::Reader and such?
Pretty sure that's for the stdlib if you require "xml"
. The compiler itself doesn't use libxml2
at all.
Yes, this information is for the standard library.
The compiler only uses parts of the standard library, and up until now does not use XML
. So libxml2
is currently not neccary for building the compiler. But the santization code uses XML
which introduces a new dependency for the compiler.
Personally, I don't have a problem with just requiring access to libxml2, but can understand that not being the most ideal solution.
- Make the dependency optional (flag -Dwithout_libxml2) to work in minimal environments where the library is unavailable.
- This flag could be added by default in the makefile, i.e. you'll need to opt in to using sanitization.
- If the flag is used, we might want to show some kind of warning when generating docs.
This is probably the most sensible solution outside of requiring libxml2.
I'm actually not sure anaymore that opting out by default is a good idea. Sorry for the inconvenience 🙈
We should be safe by default and if you want less safety, you should have to opt-out of it. Not the other way around.
We could consider doing this temporarily to softly introduce the new dependency. However, I doubt this is worth it. It's probably better to fail the build by default. This gives maintainers the option to either install libxml2 or opt out, which should not be much trouble.
This should all work fine. The only reason we had to revert #13786 is because our build environment needs to be prepared. But that should be easy enough.
Another insight on this: By adding more libraries and features we're expanding the complexity of the compiler quite a bit. Escaping HTML is not a core feature of a program that's expected to compile source code.
We might want to consider splitting the doc generator out of the main compiler program. This had been mentioned previously in https://github.com/crystal-lang/crystal/issues/6947#issuecomment-561368164: The compiler is only needed for the part of the job that's directly related to parsing the source code. It can dump that information into a structured format, and a separate tool would build the HTML. I belief this would simplify a couple of things and would make it easier to add new features that require additional libraries (like this one).
I thought about that several times when developing crystaldoc.info, didn't only due to the complexity of the task and keeping up with updates (which we wouldn't run into if this were official).
Would people still be able to generate docs using crystal doc
and have the complexity handled for them?
This issue has been mentioned on Crystal Forum. There might be relevant details there:
https://forum.crystal-lang.org/t/the-weight-of-compiler-tools/6888/1
I think moving the doc generator into its own tool is the right way forward, but is going to take some time to flesh out and implement (several months to a year). In the meantime, I think it'd be better to merge the sanitization branch now, so it's out of the way as this refactoring is happening. What do you think (pending switching the default to libxml2 being on)?
Bug Report
See https://github.com/kostya/lexbor/issues/25. The document generator didn't escape this div which led to a broken output. However, this leads to a bigger issue of being able to do cross-site scripting and other malicious attacks for anyone hosting code documentation - something I need to prevent when working on crystaldoc.info.
Minimum example: