Risks if generating PDF from user supplied HTML?

MartyMcMartface commented 7 years ago

Hi Daniel,

I want to use openhtmltopdf to generate PDFs that are partially based on user supplied HTML on a corporate website. I'm wondering if you could comment about any security issues that might raise, as it has the security team raising their eyebrows.

For example it might be possible there is a bug in the image rendering code that allows code to be executed on the server if some special image is supplied to it. Personally I think the scenario is very unlikely but I'd like to hear your take on it. The image rendering code is part of the PDFBox library isn't it? If that is pure java then it's not really vulnerable to stack buffer overflow attack is it? To me the probability of such an attack is exceedingly small but I have to convince the security team and I don't have any real knowledge of how that part of the library is implemented.

Are you aware of any other risks and how they can be mitigated when rendering user supplied HTML?

Thanks in advance Martin

danfickle commented 7 years ago

Hi @MartyMcMartface I think it is unlikely there is a risk of arbitrary code execution. However there are other risks:

CPU exhaustion from things such as endless loops.
Memory exhaustion (see #51 for example).
Information disclosure or denial of service from outgoing links.

Personally, if you have to use user supplied HTML, I would use something like a locked down docker container per instance of use. Given that arbitrary code execution is unlikely, it should be extremely difficult to escape such a container.

MartyMcMartface commented 7 years ago

Thanks Daniel.

The fact that there is some sort of memory leak bug in either PDFBox or ImageIO is evidence enough to be wary of it. If it only happens with embedded URI encoded images I can test the input for that, but if not it could cause problems even without malicious user intent. It also confirms that ImageIO uses native code to read a simple JPEG file which is a bit surprising.

Could you please elaborate on what you mean by "outgoing links"? Are you referring to images in the supplied HTML that point to third party servers? If so I was planning to prevent that issue by getting the user's browser to fetch all the required images and post them to the server. If that's not what you meant I'd be interested to know what other risk you are referring to.

Also how could the user cause an endless loop?

If I decide not to include the PDF generation functionality in the main application, do you know of any provider of an HTML to PDF web service that uses openhtmltopdf as a back end? That might be more cost effective than creating one. I'm aware of docraptor.com (using a different back end) but it's a little pricey for my project.

Thanks again for taking the time to respond. Martin

danfickle commented 7 years ago

Yes, outgoing links could be fetched either from your server with the file:// protocol for example. I think you can workaround that with a URI resolver. In regard to endless loops, I'm not aware of any bugs that cause such, but that doesn't mean that there isn't any!

I don't know of anyone providing open-source PDF renderers as a service (they mostly use Prince I believe), however, another option is to use server-less architecture. I think AWS lambda allows you to set a memory limit and time limit per run.

The Lambda environment may be reused between runs so it is not a protection against any arbitrary code execution bugs but it should prevent the other problems. Even better, it is free for small to medium work loads. The only issue may be returning the PDF. Lambda doesn't allow the returning of binary data so you would have to either base64 encode the result or upload it to S3 and get it from there.

Best of luck with your project, Daniel.

danfickle / openhtmltopdf

Risks if generating PDF from user supplied HTML? #50