elves / elvish

Powerful scripting language & versatile interactive shell
https://elv.sh/
BSD 2-Clause "Simplified" License
5.67k stars 299 forks source link

Make a helper function to reliably handle mixed byte and value input #1577

Open hanche opened 2 years ago

hanche commented 2 years ago

Consider a simple pipeline: $sender | $receiver.

Assume that $sender outputs a mix of bytes and values, and $receiver wants to handle both, but in different ways. This seems to be rather difficult – and issue #1576 illustrates what can go wrong.

So I suggest a way around the problem. There may be many other ways, but it seems to me that the most useful and flexible solution is a function that is transparent to byte input, and invokes a callback on each value.

Let's call this hypothetical function each-value for the sake of discussing it. It could be used as follows:

$sender | each-value $value-callback | $receiver

Then: each-value would read its input, pass any byte input unchanged to its output, and each time it sees a value in the input stream, it calls $value-callback with that value as its only argument. The callback could do something simple, such as collecting all the values in an array, or it could cooperate with $receiver in some more sophisticated way. The documentation should make it clear that the asynchronous nature of this process means that the relative order in which byte output and values arrive is not defined, so caveat user.

For feasibility, I refer to the top level of elvish itself, which does indeed send byte output straight to the terminal, while it handles value output separately, for easy output parsing by the user.

Of course, this is probably not the only solution to the general problem. Perhaps it is not the best solution, either.

hanche commented 2 years ago

There are other ways of achieving a similar goal. Here are two I just thought of:

A function run-with-values-and-bytes taking two arguments, both of them functions. The two functions will run in parallel. All values arriving on stdin will be passed to stdin of one function, and all bytes go to the other function. So long as both functions actually read their inputs, this should be guaranteed not to hang. Or, more ambitiously, run-with-values-and-bytes could buffer incoming values and bytes as needed, only failing if the process runs out of memory. In practice, the two reader processes might wish to interact. If no extra buffering is taking place, the onus will be on the user to avoid deadlock situations. But in this case, this responsibility will be explicit and fairly obvious, leading to fewer nasty surprises.

An alternative is to make a pair of functions that shunt either values or bytes off to a second pipe, typically created by file:pipe. So, e.g., send-values-to $pipe[w] would write value input to $pipe and bytes to stdout. Similarly (but opposite) with send-bytes-to.

krader1961 commented 2 years ago

@hanche: I'm perplexed by the assertion this proposal is a way to solve the problem documented in issue #1576. The addition of an each-value (and presumably each-bytes) function might be useful but is not necessary to solve the deadlock documented in that issue. You're simply moving the need to consume the value stream from the RHS of put x | slurp to an intermediate step as in put x | each-value {|v| nop } | slurp. As opposed to just doing put x | { slurp; set _ = (all) }. This proposal still requires the user to explicitly consume the value stream to avoid a deadlock.

hanche commented 2 years ago

@krader1961

For sure, the easiest way to resolve #1576 is likely to replace the fixed size value buffer by one that might grow without bounds, unless the process runs out of memory. But is that wise? I don't know.

But consider this:

range $large-number | each {|_] print x; put y } | consume

There is currently no way (correct me if I'm wrong) for the consume part of the pipeline to get at the byte part and the values part of the pipeline separately. With a potentially infinite values buffer, the { slurp; all } idiom will work, but as we know, that currently fails if $large-number is greater than 32. And if it could be made to work, it can stil be problematic that we have to wait for the writer to finish before we get the values. Think billions of bytes, millions of values if you wish.

I think of an elvish pipe as two separate pipes: One for bytes, one for values. They are pretty much independent. It should come as no surprise that connecting two processes by two pipes is a recipe for deadlocks, unless the programmer is careful to avoid it. I am looking for tools for the careful programmer to deal with this problem. It will still not be easy – there are no magic bullets! But at least, try to make it possible.

Granted, for every case I have come across personally, the { slurp; all } method will be sufficient if it can be made to work reliably. Dealing with long-running pipelines full of intermixed bytes and values is just something I have ever needed, nor am I likely to need it – and I could find other ways if it ever became an issue, so for me personally, this could as well be dropped.

The bottom line, though, is that there is a footgun here. It cannot be eliminated completely without abandoning the current stream concept (an absolute no-no).

But come to think of it, I just thought of yet another possibility: Extend the notion of redirection to allow redirecting only the value stream (or only the byte stream).

krader1961 commented 2 years ago

For sure, the easiest way to resolve https://github.com/elves/elvish/issues/1576 is likely to replace the fixed size value buffer by one that might grow without bounds, unless the process runs out of memory. But is that wise? I don't know.

That is most definitely not a viable solution. Not just because Go channels (which underlie the Elvish value stream) does not support dynamically resizing the buffer size of the channel. That could be circumvented by simply creating the channel with an enormous buffer size (e.g., 2^32). However, that doesn't solve the problem. It just punts it down the road.

Granted, for every case I have come across personally, the { slurp; all } method will be sufficient if it can be made to work reliably. Dealing with long-running pipelines full of intermixed bytes and values is just something I have ever needed, nor am I likely to need it – and I could find other ways if it ever became an issue, so for me personally, this could as well be dropped.

That has been my experience as well. However, here is a simple example that shows how easy it is to induce a deadlock that is likely to surprise Elvish users:

var num_vals = 32 # use 33 to produce a "hang"
var num_byte = 1
{
    for x [(range $num_vals)] { put $x }
    for _ [(range $num_byte)] { print x }
} | { slurp; take 1 }

Simply swapping slurp; take 1 with take 1; slurp in the RHS with num_vals = 33 does not hang. So I am definitely of the opinion that some changes are needed to make such deadlocks less likely. I don't know if an each-value function is the best solution but it is certainly one way to resolve this problem.