New collection, status propagation and background system

Firobe commented 3 years ago

This is meant to solve https://github.com/charlesetc/feather/issues/6 and more

So, there is a lot going on here, and there are several questions I have to clarify how we want to finalize everything. But first let me describe what this PR contains a bit.

As expected, there is now a small GADT system to select what we want to capture when running a command. It is simpler than the one described in the issue since I thought it was a better compromise to not over-engineer the code and keep a nice interface.

There are now two running commands (or three if we count run_bg): run and collect.

run runs the command as usual, does not capture anything and forgets its exit status. It returns unit.
collect takes a function which is a composition of stdout, stderr, stdout_lines and stderr_lines (the composition operator is <+> since >> is already taken. It captures only the selected ouputs and returns a record {stdout; stderr; status} where status is always present (an integer), and sdout/stderr fields have the abstract type not_collected if not selected, and type string or string list if selected, depending on if we want lines or not.

An example :

ls "." |> run;

let {stdout; _} = ls "." |> collect stderr in ...
(* This does not compile since stdout has type not_collected!*)
(* print_endline stdout *)

let {stdout; _} = ls "." |> collect stdout_lines in ...
(* stdout = ["a"; "b"; ...] *)
(* stderr is not usable *)

let {stderr; status; _} = ls "thisdoesnotexist" |> collect stderr in
(* stderr = "ls: cannot access 'thisdoesnotexist': No such file or directory" *)
(* status = 2 *)

let {stderr; stdout; status} = ls "." |. ls "thisdoesnotexist" |> collect (stdout <+> stderr) in
(* stdout, stderr and status are all defined with the previous values *)

(* We can also capture nothing and just get the status *)
let {status; _} = ls "." |> collect only_status in
(* status = 0 *)

To accomplish that, the exit status is now propagated along the evaluation of the AST: last_exit has been removed externally and internally. Hence the status is truly lost if we use run for example.

That led me down the rabbit hole of background execution (which was one of the motivation of the original issue). I realized with the help of @tmarti2 that it was fundamentally broken since the addition of the new operators &&., ||. and co. The original behavior for the background flag is to spawn each individual process of the pipeline in its own thread (and then we lost its exit status). However since &&. and ||. use that exit status to determine if we should run the right-hand side, running those operators in background was completely broken. Worse, since piping two commands A | B runs A in the background, the same problems arise if A contains &&. or ||.

To fix all that, this PR implements the following approach:

process (and filter_map) never spawns its own thread
if we use run_bg, the entire eval of the AST is done in its own thread
when piping A | B, the entire evaluation of A is done in its own thread (rather than only sub processes)

This way, even if a command is run in background (or is the left-hand side of a pipe), the exit status is correctly managed. As a consequence, background is removed from the context (since it's only used at two locations).

I added some tests for all that as well as a bit a code in example.ml

Some thoughts:

what do you think about the user interface for collect? Clearly it's not the only way to do it.
for now I left the old functions collect_stdout and collect_lines, defined as specialized versions of collect. Do we want to keep them ? (and if so, with the same name ?)
the composition operator is <+>. Is that OK?
currently if we want to only collect the status, we have to pass the identity function to collect (and I made a named identity function called only_status). Do we want to keep it like that? Use an optional parameter (but then we have to name it everywhere)?
the stdout and stderr collide with Base.stdout. Keep it like that or rename them? (and with what names?)

charlesetc commented 3 years ago

Just wanted to quickly say thanks! I've looked at this a bit but haven't really got a chance to dive in properly. Hopefully sometime this weekend.

charlesetc commented 3 years ago

So I've been thinking about this for a bit and I'm leaning towards a somewhat paired-down user-interface. Here's what I'm thinking:

I question whether it's actually that important to be able to collect just stderr and status without stdout. Or just stdout and stderr without status. The main things that I want to make easy are:

Running a process and grabbing stdout
Running a process and grabbing stderr
Running a process and grabbing both, interspersed as they arrive - which isn't supported in master

And then I think it's less important that it's easy to get the status, just important that it's possible.

So here's the UI I propose, I'm curious what you think:

type everything = { stdout : string ; stderr : string ; status : int }

type 'a what_to_collect
val stdout : string what_to_collect
val stderr : string what_to_collect
val status : string what_to_collect
val stdout_and_stderr : string what_to_collect (* string of the two interspersed *)
val everything : everything what_to_collect

val collect
    : ?cwd:string
    -> ?env:(string * string) list
    -> 'a what_to_collect
    -> cmd
    -> string

Also I think we should remove the stdout_lines and stderr_lines functions and instead expose a "lines" function that just splits on lines. In the end, the call sites would look like:

  let files = Feather.process "ls" |> collect stdout |> lines in

I like this since it doesn't complicate the normal use-case and so we're not pressured to special case it, by keeping collect_stdout around for instance.

If you end up wanting the status too, I don't think it's a problem to collect everything and just pull out what you want:

  let {stdout ; status ; _ } = Feather.process "ls" |> collect everything in

but it's nice this isn't necessary in the common case.

This would take a GADT under the hood, but above I haven't exposed the variants directly. One nice thing about exposing the variants would be that people could call something 'stdout' without having to worry about shadowing feather's helpers. But I think there's a real value in not using GADT syntax in the interface so people new to OCaml don't get tripped up. I also vaguely prefer collect stdout to collect Stdout. An alternative interface could be:

module Collect : sig
  val stdout : ?cwd:string -> ?env:(string * string) list -> cmd -> string
  val stderr : ?cwd:string -> ?env:(string * string) list -> cmd -> string
  val status : ?cwd:string -> ?env:(string * string) list -> cmd -> int
  val stdout_and_stderr : ?cwd:string -> ?env:(string * string) list -> cmd -> string

  type everything = { stdout : string ; stderr : string ; status : int }
  val everything : ?cwd:string -> ?env:(string * string) list -> cmd -> everything
end

with a call site:

  let files = Feather.process "ls" |> Collect.stdout |> lines in

This also avoids the collision with Base.stdout, though I don't mind that terribly. What do y'all think?

The bugfix for background execution seems awesome! Thanks for all the work in this PR — it'll definitely be a step forward for feather!

Firobe commented 3 years ago

Hey, thanks for looking into that!

3. Running a process and grabbing both, interspersed as they arrive - which isn't supported in master
Is it really not supported currently ? It seemed to me redirecting one to the other and then grabbing the other did exactly that. I wonder if adding a second way to do it is a good idea ? Maybe I'm missing something, but to me the natural, Unix way to intersperse both is via redirection.

I question whether it's actually that important to be able to collect just stderr and status without stdout. Or just stdout and stderr without status. The main things that I want to make easy are:

And then I think it's less important that it's easy to get the status, just important that it's possible.

If you end up wanting the status too, I don't think it's a problem to collect everything and just pull out what you want:
  let {stdout ; status ; _ } = Feather.process "ls" |> collect everything in
but it's nice this isn't necessary in the common case.

This interface is pretty nice and simple ! The only thing that bothers me is that I'm not sure I agree that it's not a problem to collect everything if you just want the status, or just the status and either stdout of stderr.

Sometimes you actively don't want to collect stdout of stderr because you want it to be printed on the terminal or wherever it's going (imagine for example a use case where the output of the script is piped into something else, but we want to capture the status (+stderror optionally) the manage errors).

So I still think having at least the possibility to choose precisely what you want is important (it does not have to be particularly simple tho). Ultimately the decision is of course up to you.

Also I think we should remove the stdout_lines and stderr_lines functions and instead expose a "lines" function that just splits on lines. In the end, the call sites would look like:

Agreed !

This would take a GADT under the hood, but above I haven't exposed the variants directly. One nice thing about exposing the variants would be that people could call something 'stdout' without having to worry about shadowing feather's helpers. But I think there's a real value in not using GADT syntax in the interface so people new to OCaml don't get tripped up.

I also completely agree. It's important to keep the interface as simple as possible.

I also vaguely prefer collect stdout to collect Stdout. An alternative interface could be: [...] This also avoids the collision with Base.stdout, though I don't mind that terribly.

Since I don't particularly mind the collision too, I'd personally lean towards the first interface rather than this Collect module.

The bugfix for background execution seems awesome! Thanks for all the work in this PR — it'll definitely be a step forward for feather!

Thank you :) Maybe I should separate the status reporting rework (which fixes background bugs) into a separate PR ?

Firobe commented 3 years ago

This branch has been rebased on master to include #15

charlesetc commented 3 years ago

Is it really not supported currently ? It seemed to me redirecting one to the other and then grabbing the other did exactly that. I wonder if adding a second way to do it is a good idea ? Maybe I'm missing something, but to me the natural, Unix way to intersperse both is via redirection.

Good point! Yeah we definitely don't need the stdout_and_stderr function as I described it.

Sometimes you actively don't want to collect stdout of stderr because you want it to be printed on the terminal or wherever it's going (imagine for example a use case where the output of the script is piped into something else, but we want to capture the status (+stderror optionally) the manage errors).

Another solid point! How would you feel about exposing:

val stdout_and_stderr : (string * string) what_to_collect
val stdout_and_status : (string * int) what_to_collect
val stderr_and_status : (string * int) what_to_collect

I think this would solve our use-case and we don't have to introduce another concept in addition to what_to_collect.

I also completely agree. It's important to keep the interface as simple as possible.

Sounds good!

Since I don't particularly mind the collision too, I'd personally lean towards the first interface rather than this Collect module.

Sweet, let's go with that then!

Thank you :) Maybe I should separate the status reporting rework (which fixes background bugs) into a separate PR ?

You're welcome to split it apart but I am also happy to review it in a chunk: whatever is best for you!

tmarti2 commented 3 years ago

Sorry for spam, I was stupid to put PR's id in my commit name Orz

Firobe commented 3 years ago

Hey! I rebased this branch on master and modified the interface and implementation with the what_to_collect type we discussed, which indeed simplifies things a lot. I also squashed my commits to avoid polluting the history.

Normally it should be ready to merge! The call sites look exactly like you described them.

The only difference for now is that when capturing everything, we return a string * string * int tuple instead of a record. I feel it is more consistent with the rest this way, but we could do as you described if you prefer to.

Firobe commented 3 years ago

Actually the README has to be changed, I forgot :p I'll do it if I have your approval for the interface itself.

charlesetc commented 3 years ago

Okay this PR is looking great! Thanks for the help :)

Re everything: I am a little sad that it's not obvious from the type which string is stdout and which string is stderr. I could imagine someone mixing those up by accident, or feeling like they had to check the documentation to make sure. I don't particularly mind the everything struct, but if you want to avoid it there's the following pattern:

val everything :                                                                                                      
  ([ `Stdout of string ] * [ `Stderr of string ] * int) what_to_collect

with call site:

let `Stdout stdout, `Stderr stderr, status =
  echo "hi" |> collect everything
in

I think I kind of prefer the record? What do you think?

If you're happy with my recent edits, I think this can be merged after we edit the readme and decide on what to do for everything!

charlesetc commented 3 years ago

Amazing, thank you!

Firobe commented 3 years ago

Too fast for me to comment ;)

Thanks for the edits, it looks good! Originally I came up with the failwith solution to avoid the small code duplication we now have in collect, but in the end the current implementation is shorter is safer.

Regarding everything, I must also agree with you in the end: it's safer and more elegant with a record than with variants.

Good work !

charlesetc commented 3 years ago

Sorry I was excited haha — You too!

charlesetc / feather

New collection, status propagation and background system #10