frescobaldi / python-ly

A Python package and commandline tool to manipulate LilyPond files
https://pypi.org/project/python-ly
137 stars 33 forks source link

Is there a security risk with ly.colorize.HtmlWriter? #43

Closed uliska closed 9 years ago

uliska commented 9 years ago

I am using the HtmlWriter to colorize code in a GitBook. For that I'm passing an arbitrary string into HtmlWriter.html() that stems from arbitrary authors They do have to have push access to the repository but on the long run this should be considered as "unsanitized" input. (I might for example open this up in the future to allow people with self-created accounts edit this and forget about this issue).

So: is there a possible injection risk with that function or will it always "only" return useless results with non-LilyPond input?

PeterBjuhr commented 9 years ago

I don't think it can be assumed that the result is harmless. The potential risks depend on the input and how it is used.

uliska commented 9 years ago

That's what I was afraid of.

The input is taken from Markdown documents, concretely from the content of a template block. So the user who saves that file can write anything inside. And the result will be directly written to the HTML page of the GitBook.

So while I assume that the people I will give push access to the book repository are reliable, this is an assumption that shouldn't seriously be taken.

So what do you think: Should python-ly take some precautions here or would you say it's the responsibility of a user (concretely: my script) to ensure nothing harmful will be passed to python-ly?

PeterBjuhr commented 9 years ago

I don't think it's the responsibility of Python-ly because the possible threats depends on how it is used. It cannot be assumed that the result should be used in a web browser.

Is the Markdown document parsed as markdown and does the parser have any security checks?

On the Python side maybe you could find somehing here: http://lxml.de/lxmlhtml.html#cleaning-up-html!??

uliska commented 9 years ago

GitBook is a Node.js application and supports a templating mechanism called "blocks". When supported by a plugin the author can write for example

{% lilypond %}
\version "2.19.21"
{ 
  d e
}
{% endlilypond %}

The plugin author can define any number of such blocks. In the block handler he can access the content of the block (in the above example the complete LilyPond code) through block.body and do whatever seems appropriate. The handler is expected to return HTML, which is then inserted into the resulting page as-is.

So by default what goes into the handler (or in the concrete case: into python-ly) is a raw string, and the resulting output of python-ly is passed directly to the browser.

So it seems clear that I have to take care of possible injections. The question is: should I "sanitize" the output of python-ly or should I rather ensure beforehand what gets in? Of course dealing with HTML is much easier as there are solutions available such as the lxml link you posted. OTOH it would be interesting to see if injected code could actually do harm inside python-ly already.

PeterBjuhr commented 9 years ago

For possible Python injections in Python-ly the situation may be different regarding responibility. But I'm not sure how it could be done if the string isn't executed or evaluated.

uliska commented 9 years ago

That was another question I had. But I also think the input won't get executed. It is only parsed to possible tokens of the recognized languages.

I just did a test and passed <a href="#" onclick="alert(this)">Click Me</a> to the script. The result was quite promising:

<pre class="lilypond">
<span class="html-tag keyword">&lt;a</span> 
<span class="html-attribute variable">href</span>=<span class="html-string string">"#"</span>
<span class="html-attribute variable">onclick</span>=<span class="html-string string">"alert(this)"</span>
<span class="html-tag keyword">&gt;</span>Click Me<span class="html-tag keyword">&lt;/a&gt;</span>
</pre>

This makes me quite confident that python-ly will automatically "escape" any HTML or inline JavaScript to a "display representation" of it that shouldn't do any harm.

I will do some experiments with the lxml approach nevertheless.

PeterBjuhr commented 9 years ago

I tried this:

\relative { c1^"<script>alert(\"You have been hacked\");</script>" }

Also with promising result.

uliska commented 9 years ago

OK, thanks. I will leave it as it is for now. This issue has to be reconsidered if I should make that script more commonly available (which I think is a good idea but not necessarily intended right now). As it stands the thing will only be "used" by contributors that have received their push access to the repository personally by me, so it's not really publicly exposed.

wbsoft commented 9 years ago

If you pass text to ly.colorize, it is parsed by guessing the text contents (currently LilyPond, Scheme, HTML, LaTeX, DocBook and texinfo are recognized. (If you want to force a particular parsing mode, you should call ly.document.Document.setmode() to force the mode to the document. If you are using a Document inside Frescobaldi, Frescobaldi determines the mode in documentinfo.DocumentInfo.mode().)

You should take care by yourself which text you use as input. Security concerns differ in all kinds of contexts. It depends on how you use the resulting text.

The HTML output of ly.colorize will never contain harmful HTML or scripts. Everything is escaped correctly, and only <span>, <pre> etc tags are added.

uliska commented 9 years ago

Thanks. As you can see from my last comment I settled with accepting that security level as sufficient for my current case.