dbuenzli / bos

Basic OS interaction for OCaml
http://erratique.ch/software/bos
ISC License
65 stars 16 forks source link

OS.Cmd \n -> \r\n conversions on Windows #65

Open hcarty opened 8 years ago

hcarty commented 8 years ago

Under Windows, using the following programs the \n in data is converted to \r\n when the two programs are compiled and run as main.exe test.exe:

screen shot 2016-10-28 at 3 43 04 pm

Under something more Unix-y there is no issue, I'm guessing because there is no difference there between "text" and "binary" IO.

main.ml

open Rresult
open Bos

let () =
  let exe = Fpath.of_string Sys.argv.(1) |> R.error_msg_to_invalid_arg in
  let cmd = Cmd.(v (p exe)) in
  let data = "Hello world\n\x00\x00booooo" in
  let cmd_stdin = OS.Cmd.in_string data in
  let run = OS.Cmd.run_io ~err:OS.Cmd.err_stderr cmd cmd_stdin in
  match OS.Cmd.out_string run with
  | Ok (output, (_, `Exited 0)) ->
    if output = data then
      print_endline "Good"
    else (
      print_endline "Not good";
      Printf.printf "%S\n" data;
      Printf.printf "%S\n" output
    )
  | _ -> print_endline "Bad"

test.ml

open Rresult
open Bos

let () =
  OS.File.read OS.File.dash
  |> R.error_msg_to_invalid_arg
  |> print_string
dbuenzli commented 8 years ago

Isn't the problem here that you are using Pervasives.print_string in test.ml ? I guess this is what does the translation.

I just checked and the OS.Cmd combinators only use Unix.openfile (which AFAIK is oblivious of the distinction between text and binary) and OS.File only uses open_in_bin.

dbuenzli commented 8 years ago

In fact you are using OS.File.read with OS.File.dash which in this case will not use open_in_bin but simply use Pervasives.stdin. So I guess this is the actual source of the problem.

I could use Pervasives.set_binary_mode_in on Pervasives.stdin before doing the read but there's no way to query the status to set it back to its previous state. I guess these are the only things that can be done:

  1. Simply document this fact and let users call set_binary_mode_{in,out} themselves.
  2. Call set_binary_mode_{in,out} on stdin and stdout in OS.File's IO functions and document the side effect performed on the channel.

What would Windows users expect ?

I kind of think that these automatic translations are more harmful than beneficial so I'm tempted by 2. With 1. it basically means that non Windows user have to think about this and they will not.

/cc @fdopen @dra27

dra27 commented 8 years ago

The source of this problem is print_string - all the calls in main.ml use binary Unix.file_descr so test.ml should only receive \n on stdin, therefore the fact that stdin is in text mode won't matter. However, since stdout is in text mode in test.ml, print_string will translate \n\r\n which main.ml reads back without translation. However, if main.ml actually had "Hello world\r\n\x00\x00booooo" then this example would work, but for scary and unstable reasons! (and weird examples like "Hello world\r\n\n\x00\x00booooo" would not work)

I think the expectation would be that stdhandles are treated in the same way as files, certainly. I don't have a reasoned opinion on which way it should be, though, I'm afraid! There is already a disparity between how Unix and Pervasives do this, for example:

Unix.(write stdout "\n" 0 1) |> ignore;
Pervasives.(output stdout "\n" 0 1) |> ignore

writes "\n\r\n" to stdout on Windows.

I expect that the rationale for this may be that the standard library behaves like the underlying C runtime (where \n really should mean "do whatever is necessary to make a line-end" and in Microsoft C by default does translations) vs the Unix library which aims to behave the same way on all platforms as far as is reasonably possible.

You can detect whether a channel is in binary mode via an exposed runtime function, but you have to use a slightly unscrupulous, if entirely safe, C stub to get at it:

#include <caml/memory.h>

CAMLextern int caml_channel_binary_mode (void *);

CAMLprim value caml_ml_channel_binary_mode(value vchannel)
{
  CAMLparam1(vchannel);
  CAMLreturn(Val_bool(caml_channel_binary_mode(*((void **) (Data_custom_val(vchannel))))));
}

which can then be used with

external get_binary_mode_in : in_channel -> bool = "caml_ml_binary_mode"
external get_binary_mode_out : out_channel -> bool = "caml_ml_binary_mode"

It's a surprising omission from Pervasives...