brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
937 stars 99 forks source link

restrict operations according to shell-escape, openin_any, openout_any #2218

Open xworld21 opened 1 year ago

xworld21 commented 1 year ago

Running LaTeXML on untrusted inputs is dangerous in so far as it will run arbitrary perl code (by loading .ltxml files) and read or write to arbitrary locations (in different phases: \input in TeX, document() in XSLT, etc).

TeX has a simple security model: -shell-escape (and environment variable shell_escape) controls arbitrary code execution; openin_any, openout_any[^1] control whether access is restricted to the current/output directories or all the filesystem.

Maybe LaTeXML could follow the same model?

[^1]: Documentation at https://tug.org/texinfohtml/kpathsea.html#Calling-sequence. Values a (all), p (paranoid), r (restricted), plus some backward compatibility aliases.

dginev commented 1 year ago

Some relevant prior discussion is in #606 which lead to the secureio plugin.

Generally it's quite hard to improve the safety profile of latexml with claims about it being "complete", especially in the command-line use cases.

It is a little more manageable to containerize the conversion in e.g. a Docker image (related #1178) and pose restrictions on the source contents being passed in. Though they are not mutually exclusive.

xworld21 commented 1 year ago

I think this kind of change is more feasible in light of #2185 - I like to think I have found all the I/O happening in latexml, and adding some hooks to stop with errors when reading or writing outside the boundary set by open(in|out)_any is doable, in principle. But this could just be an intermediate step: first an implementation of -recorder, and once it seems complete, you can bolt on I/O filtering. (Full filtering requires #2053 to also catch I/O from LibXSLT.)

For this to make any sense, one also need some form of -shell-escape to forbid custom .ltxml bindings, i.e. bindings should be loaded from the default locations, but not from . unless specifically requested with --path=. or --shell-escape.

If the above is workable, latexml could reach the same safety profile of a normal LaTeX run, which is a familiar thing.

Of course I am making big assumptions about the other tools (dvipng, dvisvgm, Ghostscript, and [shudders] ImageMagick) having a similar safety approach, i.e. not reading from/writing to arbitrary locations when fed dodgy inputs.