cisco / ChezScheme

Chez Scheme
Apache License 2.0
7k stars 987 forks source link

Feature Request: Add a function to execute a file without going through the Shell. #604

Open luke21923 opened 2 years ago

luke21923 commented 2 years ago

Hello,

I have seen some software projects using Python as a scripting language for the build process (instead of the Bourne Shell). I like the idea, because it makes it easier to build on Windows.

I would like to use Chez Scheme this way, because it is already a dependency for my project. With Chez Scheme, I can call gcc with the system function, but it is not ideal, because it goes through the Shell, and substitutions might occur, depending on the default Shell interpreter (bash or ksh or csh or cmd.exe/PowerShell or ...).

If you think it is a good idea to have a function that executes a file without involving the Shell, and you consider adding it to Chez Scheme, you might want to take a look at how it is done in other Schemes first:

GNU Guile (there is no way to save the output to a log file):

(system* "ls" "-a" "-l")

Racket)) (there is no way to save the output to a log file):

(system* "/bin/ls" "-a" "-l")

MIT Scheme:

(load-option 'synchronous-subprocess)
(run-synchronous-subprocess "ls" '("-a" "-l") 'output port)

And Python 3 does it this way:

import subprocess
subprocess.run(["ls", "-a", "-l"], stdout=file_handle)

On Unix, this feature probably requires the use of fork() and execv() instead of system().

Thanks for this great Scheme implementation!

akeep commented 2 years ago

Actually, Chez Scheme already supports this via process or open-process-ports. Internally this uses fork on unix-like operating system and CreateProcessW on windows (see the code for s_process in the c subdirectory for reference).

The only big difference is that these functions provide I/O for stdin, stdout, and (in the case of open-process-ports) stderr, as well as the pid.

You can see: https://cisco.github.io/ChezScheme/csug9.5/foreign.html#./foreign:s5 for more information.

You could write something like system* as:

(define (system* cmd . args)
  (apply (lambda (from-stdout to-stdin pid)
           (close-output-port to-stdin)
           (display (get-string-all from-stdout))
           pid)
    (process (format "~s~{ ~s~}" cmd args))))

This would display the output to stdout and return the process id (similar to my reading of the GNU Guile documentation you linked):

> (system* "ls" "-a" "-l")
total 177608
drwxr-xr-x@ 25 akeep  admin       800 May  3  2021 .
drwxr-xr-x@ 15 akeep  admin       480 May  3  2021 ..
-rwxr-xr-x@  1 akeep  admin       683 May  3  2021 drracket
-rwxr-xr-x@  1 akeep  admin       763 May  3  2021 gracket
-rwxr-xr-x@  1 akeep  admin       776 May  3  2021 gracket-text
-rwxr-xr-x@  1 akeep  admin       740 May  3  2021 mred
-rwxr-xr-x@  1 akeep  admin       776 May  3  2021 mred-text
-rwxr-xr-x@  1 akeep  admin       707 May  3  2021 mzc
-rwxr-xr-x@  1 akeep  admin       715 May  3  2021 mzpp
-rwxr-xr-x@  1 akeep  admin  45432096 May  3  2021 mzscheme
-rwxr-xr-x@  1 akeep  admin       717 May  3  2021 mztext
-rwxr-xr-x@  1 akeep  admin       720 May  3  2021 pdf-slatex
-rwxr-xr-x@  1 akeep  admin       685 May  3  2021 plt-games
-rwxr-xr-x@  1 akeep  admin       699 May  3  2021 plt-help
-rwxr-xr-x@  1 akeep  admin       702 May  3  2021 plt-r5rs
-rwxr-xr-x@  1 akeep  admin       702 May  3  2021 plt-r6rs
-rwxr-xr-x@  1 akeep  admin       709 May  3  2021 plt-web-server
-rwxr-xr-x@  1 akeep  admin  45415504 May  3  2021 racket
-rwxr-xr-x@  1 akeep  admin       707 May  3  2021 racket-documentation
-rwxr-xr-x@  1 akeep  admin       684 May  3  2021 raco
-rwxr-xr-x@  1 akeep  admin       706 May  3  2021 scribble
-rwxr-xr-x@  1 akeep  admin       704 May  3  2021 setup-plt
-rwxr-xr-x@  1 akeep  admin       716 May  3  2021 slatex
-rwxr-xr-x@  1 akeep  admin       685 May  3  2021 slideshow
-rwxr-xr-x@  1 akeep  admin       697 May  3  2021 swindle
51325

(Yes, I called system* in the Racket v8.1 bin directory---I happened to test out system* there to make sure I understood what it did.)

Also, note the use of format in the call to process to put the command and arguments together.

jltaylor-us commented 2 years ago

process still invokes a shell (/bin/sh, specifically), and the example quotes but does not escape string arguments.

Writing a function to properly escape the arguments is not exactly complicated, but it is extra work that wouldn't have to happen if a more direct interface to fork/exec were provided.

akeep commented 2 years ago

Ugh. You are correct, my apologies, I missed the /bin/sh in the execl call. The quoting was initially accidental, but I realized I had done that, and it actually allows for arguments that have spaces in them, which I liked, so I left it that way :)

luke21923 commented 2 years ago

Thank you for trying. On Debian, when I try akeep's (system*) function, it works as expected for this call (there is no shell substitution):

> (system* "echo" "*")
*
2026

But unfortunately there is a shell substitution for this call:

> (system* "echo" "$$")
2029
2029

I think I could get by with the current (system) function, if I only invoke gcc on simple alphanumeric filenames. But it would not be a very robust design.

The idea was to use Chez Scheme as a cross-platform Shell. And the basic task of a Shell is to execute programs in subprocesses (create subprocess, pass arguments, set up stdin/stdout/stderr, wait for subprocess termination, collect exit value).

This would display the output to stdout and return the process id (similar to my reading of the GNU Guile documentation you linked)

Guile's function displays the output to stdout, and returns the process' exit value (typically zero when everything went well). It would be a good idea to return the subprocess id if we were executing it asynchronously. But we are executing it synchronously (we wait for its termination), so its exit value is a more relevant information to return.

It would be more useful to store the output in a log file, though. The MIT Scheme implementation is better at that, because it provides optional arguments allowing us to redirect the standard output of the subprocess to a specific port.

A drawback of using MIT's interface is the length of the name (run-synchronous-subprocess is 26 characters long). However, this long name has the advantage of being crystal clear.

Concerning the handling of the $PATH environment variable, I think Racket has a neat solution: just ignore it, and provide another function to locate an executable with the help of the $PATH environment variable. Doing it this way allows us to bypass $PATH completely if we want to.

jltaylor-us commented 2 years ago

I wonder if scsh still works.

melted commented 2 years ago

It would be nice if there was a way to execute processes without starting a shell. You don't want to send any untrusted data via the shell. Sending it as args to one program just requires that that program can't be subverted with garbage input. Say you have a web service to convert jpg to png, by running magick <filename.jpg> <filename>.png. If it shells out, a user could supply a file named ; rm -rf / .jpg and bad things would happen. If it doesn't shell out, imagemagick will just convert the file (provided imagemagick can't be subverted via odd file names of course).

luke21923 commented 2 years ago

There are indeed a bunch of security issues with the Bourne Shell.

I read the paper describing the design principles behind scsh. That project is interesting, but it turns out that scsh targets specifically Unix, so it is a non-starter for me (Windows support is not optional).

scsh has the ability to connect the standard output of one subprocess to the standard input of another subprocess (this is an unnamed pipe). I don't know if it is possible or desirable to achieve this with Scheme ports. But that is an issue regarding the port module, I guess it is independent of a potential inclusion of an MIT style run-synchronous-subprocess function into Chez Scheme.

LiberalArtist commented 2 years ago

Racket)) (there is no way to save the output to a log file):

(system* "/bin/ls" "-a" "-l")

FYI, Racket arranges for stdout and stderr to be attached to (current-output-port) and (current-error-port), respectively, e.g.:

#lang racket
(require rackunit)
(check-equal?
 (with-output-to-string
   (λ ()
     (system* "/usr/bin/echo" "$$")))
 "$$\n")

So you can use with-output-to-file or whatever to write to a file. Lower-level functions like process*/ports provide even more options.

I believe Guile supports something similar, though I'm less familiar with the details.

You might also be interested in Will Hash's Rash: The Reckless Racket Shell. (There's also a GCPE paper.)

AlQuemiste commented 2 years ago

A source of inspiration would be Perl's system (or exec) command: "If there are no shell metacharacters in the argument [of system], it is split into words and passed directly to execvp, which is more efficient." See perldoc -f system or https://perldoc.perl.org/functions/system.

jltaylor-us commented 2 years ago

Magically changing the behavior based on scanning the string and assuming we know what /bin/sh would do with it seems like an even worse idea.

melted commented 2 years ago

It's not efficiency that is why I would like a shell-free alternative, it's because starting a shell makes it very hard to secure. A fallback to execv if a shell is not needed doesn't help that use case.

lambdadog commented 2 years ago

I did a proof of concept for this with plain scheme to see how it would work, and you can absolutely just do something like this (ignore the newline I forgot to add in the display, oops):

poc

execvp.sls

;; -*- mode: scheme; coding: utf-8 -*-
;; Copyright (c) 2022 
;; SPDX-License-Identifier: MIT
#!r6rs

(library (execvp)
  (export call
      call-output-to-file)
  (import (chezscheme))

  (define (dup fd)
    ((foreign-procedure "dup" (int) int) fd))
  (define (dup2 fd1 fd2)
    ((foreign-procedure "dup2" (int int) int) fd1 fd2))

  (define (execvp prog args)
    ((foreign-procedure "execvp" (string void*) int) prog args))

  (define (string->cstring str)
    (let* ([bv (string->bytevector str (native-transcoder))]
       [len (bytevector-length bv)]
       [buf (foreign-alloc (* (+ 1 len) (foreign-sizeof 'unsigned-8)))])
      (let loop ((idx 0))
    (cond
     ((>= idx len) '())
     (#t (begin
           (foreign-set!
        'unsigned-8 buf
        (* idx (foreign-sizeof 'unsigned-8))
        (bytevector-u8-ref bv idx))
           (loop (+ idx 1))))))
      (foreign-set!
       'unsigned-8 buf
       (* len (foreign-sizeof 'unsigned-8))
       0)
      buf))

  (define (string*->arg* str*)
    (let* ([len (length str*)]
       [buf (foreign-alloc (* (+ 1 len) (foreign-sizeof 'void*)))])
      (let loop ([idx 0]
         [str* str*])
    (cond
     ((null? str*) '())
     (#t (begin
           (foreign-set!
        'void* buf
        (* idx (foreign-sizeof 'void*))
        (string->cstring (car str*)))
           (loop (+ idx 1) (cdr str*))))))
      (foreign-set!
       'void* buf
       (* len (foreign-sizeof 'void*))
       0)
      buf))

  (define (call prog str*)
    (if (= 0 ((foreign-procedure "fork" () int)))
    (execvp prog (string*->arg* (cons prog str*)))
    (begin
      ((foreign-procedure "wait" (void*) int) 0)
      (void))))

  (define (call-output-to-file file prog str*)
    (if (= 0 ((foreign-procedure "fork" () int)))
    (call-with-output-file file
      (lambda (port)
        (dup2 (port-file-descriptor port) 1)
        (dup2 (port-file-descriptor port) 2)
        (execvp prog (string*->arg* (cons prog str*)))))
    (begin
      ((foreign-procedure "wait" (void*) int) 0)
      (void))))

  (load-shared-object #f))

I wouldn't be surprised if there's a better way to write the foreign bit, but it's just a PoC.

lambdadog commented 2 years ago

On Windows you would want to use _spawnvp over fork+execvp, I imagine, although I prototyped this on Linux and don't have a Windows machine in easy reach -- sorry! Looks like to replicate the behavior I have here you would just do a synchronous _spawnvp -- you would need to use _dup and _dup2 to cache and restore stdout and stderr, as opposed to just using dup2 to set it, though. Something like: (untested)

(define (dup fd)
  ((foreign-procedure "_dup" (int) int) fd))
(define (dup2 fd1 fd2)
  ((foreign-procedure "_dup2" (int int) int) fd1 fd2))

(define (spawnvp mode prog args)
  ((foreign-procedure "_spawnvp" (int string void*) int) mode prog args))

(define (call prog str*)
  (execvp prog (string*->arg* (cons prog str*)))
  (void))

(define (call-output-to-file file prog str*)
  (call-with-output-file file
    (lambda (port)
      (let ([stdout (dup 1)]
        [stderr (dup 2)])
    (dup2 (port-file-descriptor port) 1)
    (dup2 (port-file-descriptor port) 2)
    (spawnvp 0 prog (string*->arg* (cons prog str*)))
    (dup2 stdout 1)
    (dup2 stderr 2)
    (void)))))

fork+execvp should work on OS X just fine as well though, although I haven't tested it.

If you wanted to process the output of the command I'm not sure exactly what I'd recommend -- dup and dup2 only work with file descriptors and only file ports have those -- well, and stdin, stdout, stderr. You could perhaps create a pipe (linux, osx, windows) from the process's stdout to your stdin (making sure you've consumed all stdin first), then read it all on return (or if implementing async, which you should probably do, admittedly, read it while the process is running).

That said you can't really distinguish between stdout and stderr if doing that. memfd_create would work but it doesn't exist on Windows or MacOS, sadly, and anything more complex I would honestly want to work with from C and expose to Scheme, not work with entirely from Scheme.

I suspect outputting to a file and just reading it in is what you'd want to do anyway -- I expect most build systems do something similar so that the build logs are available.