to-nums and from-nums: handling binary I/O

notcancername commented 1 year ago

Rationale

Traditional shells don't do binary I/O. Elvish should fix this. This PR will add two builtins, to-nums which converts from the binary input stream to the value output stream as bytes, and from-nums, which does the reverse. Both work, but aren't production ready yet. I apologize for the (lack of) code quality, this is my first time writing Go. Please tell me if something doesn't seem right.

TODO

[x] eliminate heap allocation with make
[ ] make both accept a maximum parameter
[ ] write documentation
[x] proper error handling

[ ] fix not responding to ^C when reading from the tty

Example

> print 'amogus'"\n" | to-nums
▶ (num 97)
▶ (num 109)
▶ (num 111)
▶ (num 103)
▶ (num 117)
▶ (num 115)
▶ (num 10)
> put (num 115) (num 117) (num 115) (num 10) | from-nums
sus

krader1961 commented 1 year ago

fix not responding to ^C when reading from the tty

Known issue you definitely should not attempt to fix as part of this change.

write documentation proper error handling

These both need to be part of any change; not a todo.

make both accept a maximum parameter

I think this should be part of the initial implementation and, for consistency with other commands (e.g., https://elv.sh/ref/str.html#str:replace), should be a &max option that defaults to -1 to mean no limit. TBD is what an explicit zero limit means.

eliminate heap allocation with make

I don't grok that todo. Whether or not the make function is used is mostly irrelevant to whether an object is stack or heap allocated. See https://go.dev/doc/faq#stack_or_heap and numerous blog posts and stackoverflow questions. :-)

krader1961 commented 1 year ago

Regarding the heap allocation todo I see you just pushed commit 03fb8a5. See this from the Go FAQ:

What's the difference between new and make? In short: new allocates memory, while make initializes the slice, map, and channel types.

See the relevant section of Effective Go for more details.

In short, I don't see any reason why using either form will affect whether the structure is instantiated on the heap versus the goroutine stack. They should be equivalent so it should be strictly an issue of clarity and consistency. This also seems like a micro optimization that should probably be avoided until it is shown to be a performance issue even if the original formulation, using make(), was proven to allocate the structure on the heap and the new form puts it on the stack.

krader1961 commented 1 year ago

Your from-nums is logically identical to str:from-codepoints; albeit with the difference of a binary (num) versus string (hexadecimal) representation. I think there are important questions regarding how your new commands relate to the legacy commands in the str module.

I haven't yet looked at your implementation to see if it correctly handles values outside the range [0, 255] but that is obviously important, and requires defining what should happen when from-nums is handed values outside that range.

krader1961 commented 1 year ago

Your from-nums is logically identical to str:from-codepoints...

I have no idea why I wrote that since it is obviously not true except for values in the range [0, 127]. Nonetheless, the similarity of the existing commands in the str module and your new commands suggests there should be some discussion about the APIs for manipulating "binary" data; including UTF-8 sequences. I'd like to see an issue opened to discuss the implementation of new commands such as proposed in this pull-request rather than simply merging this enhancement. My worry is that this sort of change introduces a "kitchen sink" approach to adding new commands rather than a well thought out API.

Note that I do agree there should be better support for manipulating "binary" data. However, the definition of "binary data" as used in this change is rather narrow. What about integers rather than bytes? What about integers with different endianess than used by the platform Elvish is running on? What about other binary values such as floating point values in the allowable IEEE 754 standard when serialized?

krader1961 commented 1 year ago

I'm going to open an issue because I think this requires a lot more discussion before new commands are implemented. I'll note, however, that your to-nums is equivalent to this:

> str:to-utf8-bytes 你好 | each {|x| num $x }
▶ (num 228)
▶ (num 189)
▶ (num 160)
▶ (num 229)
▶ (num 165)
▶ (num 189)

Why the existing str:to-utf8-bytes uses a hexadecimal string representation for its output is an open question. The closest thing to an answer is contained in this issue. Also, str:from-utf8-bytes requires a valid UTF-8 sequence but I can't see a good reason why that restriction is necessary (it should at least should be optional). Its inverse, str:to-utf8-bytes, does not require a valid UTF-8 sequence:

> str:to-utf8-bytes "\xF0\xFF你"
▶ 0xf0
▶ 0xff
▶ 0xe4
▶ 0xbd
▶ 0xa0

notcancername commented 1 year ago

Thanks for the plentiful feedback.

These both need to be part of any change; not a todo.

I was more meaning for this to be a kind of roadmap to completion, not something to deal with in some indefinite future. This is a draft, after all.

I don't grok that todo. Whether or not the make function is used is mostly irrelevant to whether an object is stack or heap allocated. See https://go.dev/doc/faq#stack_or_heap and numerous blog posts and stackoverflow questions. :-)

My bad. I mistook make for some kind of malloc-equivalent.

I haven't yet looked at your implementation to see if it correctly handles values outside the range [0, 255] but that is obviously important, and requires defining what should happen when from-nums is handed values outside that range.

These are now handled by returning an error.

krader1961 commented 1 year ago

@notcancername: FWIW, do keep in mind issue #1653 that I opened a few hours ago. I think your proposed enhancement is worthwhile, and your draft implementation is a step in the right direction. I just think more discussion is required before we merge a change that implements new commands for handling "binary data". In part because of the fuzzy definition of "binary data", but mostly because we already have two commands which mostly do what your change implements. I'd prefer to see the existing two commands deprecated (and eventually removed) in favor of a more flexible mechanism similar to what you are implementing in this change. Obviously I am not speaking for the project owner, @xiaq, who may have a different opinion. I'm simply cautioning against investing more work on this change until there is a consensus on the path forward.

krader1961 commented 1 year ago

@notcancername, You wrote

I was more meaning for this to be a kind of roadmap to completion, not something to deal with in some indefinite future. This is a draft, after all.

FWIW, there really isn't anything like a "draft" pull-request. A pull-request is, by definition, a request to merge a change. A tentative change is better addressed by posting a link to a Github branch you have published in the relevant issue. You can't really say "this is a work in progress" while also asking that it be merged. Also, while you wrote "Both work, but aren't production ready yet." that is not at all obvious your pull-request is not meant to be merged in its current state.

notcancername commented 1 year ago

FWIW, there really isn't anything like a "draft" pull-request. A pull-request is, by definition, a request to merge a change.

Draft pull requests can't be merged. They are not intended to be merged. They can become proper pull requests, which are intended for merging.

A tentative change is better addressed by posting a link to a Github branch you have published in the relevant issue.

GitHub's draft pull requests address this issue better.

Also, while you wrote "Both work, but aren't production ready yet." that is not at all [making] obvious your pull-request is not meant to be merged in its current state.

The "draft" functionality of GitHub already takes care of this.

krader1961 commented 1 year ago

FWIW, @notcancername, I have never used, or seen used, the Github "draft" pull-request feature (and I've been contributing to projects on GitHub for many years). I had to take another look at this PR, in response to your comment, to notice the "Draft" icon. And that icon was immediately followed by this text: "notcancername wants to merge 3 commits into elves:master from notcancername:bytes-nums". Which implies you want it merged as is. I am definitely ambivalent about Github's "draft PR" feature given this interaction with it. I accept that you are making a good faith attempt at saying "please look at this change and provide feedback but do not merge it". However, I still feel that it is simpler, and clearer, to just state in the PR comment that the proposed change should not be merged and was only created to open a dialog for how something like this might be implemented.

Furthermore, I strongly feel that changing the functionality of Elvish in non-trivial ways needs more discussion than a couple of comments on the instant messaging channels. Which is why I opened issue #1653.

xiaq commented 1 year ago

These functions seem useful. The various functions in str: operate on individual string values, but these are streaming functions. As much as I'd like to have functions that operate both on byte streams and individual string values, I've yet to find a good design so this addition is reasonable for now.

I will take a deeper look at this later on (some recent events at work means I don't have a lot of energy to spend on Elvish right now), but I've said in https://github.com/elves/elvish/blob/master/CONTRIBUTING.md that it's OK to open a PR and use it as a venue for discussion, as long as it is acknowledged that there's a higher risk of such PRs getting rejected.

I don't intend to regulate where discussions exactly happen as long as it's reasonable easy for me to find, and PRs are as easy to find as issues.

I also don't keep a very high bar for merging PRs these days; I would usually merge as long as I agree with the overall design and the places that I would like to revise are relatively straightforward.

krader1961 commented 1 year ago

These functions seem useful.

I agree that these functions are useful. Which is why I opened issue #1653. My disagreement is that this proposal does not even acknowledge the existence of the existing str: module functions that are roughly equivalent; let alone explain why both should be retained. This is a good example of a useful proposed enhancement that needs a lot of discussion before any change is merged.

krader1961 commented 1 year ago

various functions in str: operate on individual string values, but these are streaming functions. As much as I'd like to have functions that operate both on byte streams and individual string values, I've yet to find a good design so this addition is reasonable for now.

I don't really understand why https://elv.sh/ref/str.html#str:from-utf8-bytes doesn't employ the inputs? idiom. This change can replace str:from-utf8-bytes if it includes something like a &utf8=$true option to raise an exception if the input is not valid UTF-8.

xiaq commented 1 year ago

I don't really understand why https://elv.sh/ref/str.html#str:from-utf8-bytes doesn't employ the inputs? idiom. This change can replace str:from-utf8-bytes if it includes something like a &utf8=$true option to raise an exception if the input is not valid UTF-8.

The inputs? idiom applies to functions with an optional final list argument. There's a valid analog here for an optional final string argument, but it's something new, not the same as the existing inputs? idiom.

Establishing such a pattern is an interesting proposal in itself, but has some pretty far-reaching consequences.

First, functions that currently only take bytes inputs should also take a final string argument, and not read bytes inputs when the argument is given. This affects the following groups of builtin functions:

from-json, from-lines, from-terminated
read-line, read-upto
slurp

For the from-* functions, supporting for example from-json '["foo", "bar"]' is an unambiguous improvements. For the remaining ones, the benefits are dubious but it seems harmless if a little bit confusing.

Second - and this is the more interesting aspect - functions that currently only take a final string argument as its "main input" (f ... $str) should also take bytes inputs when that argument is not given. Aside from str:to-utf8-bytes, this also affects str:to-codepoints and all the str: functions whose "subject" comes last [1], which includes str:replace, str:split, str:title, str:to-lower, str:to-title, str:to-upper and str:trim-space.

There's some merit to this: this allows one to use these functions to process byte streams (str:to-upper instead of tr a-z A-Z, for example). But this change is not enough in its own:

All of these functions, except str:to-utf8-bytes, str:to-codepoints and str:split, currently output a single string value, rather than writing bytes. This results in a weird asymmetry: in some-external-command | str:to-upper, str:to-upper takes bytes inputs and turns that into a string value.

This probably means that these functions should also output bytes when the inputs are in bytes. This is fine for the affected functions per se, but it creates inconsistency. One has to remember that, say, str:split always outputs values regardless of how the input is given, but str:to-upper can output values or bytes depending on how the input is given.
Making the str:to-upper in some-external-command | str:to-upper read and write bytes has another consequences though: there will now be an expectation that str:to-upper processes its input in a streaming way, writing outputs as soon as possible there're enough inputs.

This means that these functions can no longer use the implementations provided by Go's strings package, because the latter operates on "whole" strings. This is doable for str:to-upper, but it can get quite tricky for, say str:replace.

I'm not sure how I feel about the overall value of adopting this pattern yet. There are pros and cons and I'll need to consider more to make up my mind.

[1] This includes all the unary functions and some higher-arity functions. Unfortunately, when a function in str: takes 2 or more arguments, the "main input" usually comes first rather than last. One can bend the pattern further to say that the first string argument can be substituted by bytes inputs, but there's no need for that sort of complication for now.

Some additional thoughts, more relevant for this PR itself.

The idiom of letting an optional final string argument replace bytes inputs (and vice versa) is not relevant to from-nums. There is an analog with the existing str:from-utf8-bytes, but the difference between from-nums and str:from-utf8-bytes is how the output is written, not how the input is read. So from-nums is still needed.

I can see a potential proposal of treating bytes outputs and a single string value output in some cases, but again, that's another issue and it's much more complicated than the input question.
The names {from to}-nums are probably better changed to {from to}-bytes, because "num" doesn't really say what these numbers are.

However, both naming choices result in an awkward inconsistency with the existing from-* and to-* functions: the existing from-* functions all convert bytes to values, and to-* in the other direction. But from-bytes convert values to bytes, and to-bytes converts bytes to values.

This may seem to be a relatively minor point, but naming is important. If there's no way to square the circle, it's probably best to avoid the from- and to- altogether in these new commands.

I don't have a conclusion yet. I'll keep the PR open for now but I'll need to rethink and revisit in a while.

krader1961 commented 1 year ago

The inputs? idiom applies to functions with an optional final list argument.

I understand that. What I don't understand is why str:from-utf8-bytes doesn't use that pattern. It seems to me that it should either take an explicit list or read from its value stream. The str:to-utf8-bytes case is different. It is not at all obvious the two commands are inverses of each other vis-a-vis their API. I need to spend more time thinking about the rest of your reply but at this juncture I am still of the opinion that str:from-utf8-bytes, or a replacement that optionally enforces UTF-8 semantics, should use the inputs? idiom.

xiaq commented 1 year ago

The inputs? idiom applies to functions with an optional final list argument.

I understand that. What I don't understand is why str:from-utf8-bytes doesn't use that pattern. It seems to me that it should either take an explicit list or read from its value stream. ...

Oh sure, I misread. str:from-utf8-bytes and str:from-codepoints should use that pattern. Seems to be just something that slipped under the radar.

krader1961 commented 1 year ago

The names {from to}-nums are probably better changed to {from to}-bytes, because "num" doesn't really say what these numbers are.

Agreed. In fact, that was one of my earlier points since "num" in this context could mean 8, 16, 32, 64 bit (signed/unsigned, little/big endian) ints, floats, etc. Which is why this sort of change needs a lot of discussion and why I opened issue #1653. It is also why I am opposed to merging this change in its current form since it seems highly likely that its commands will be deprecated sooner rather than later.

Sorry, @notcancername. While I appreciate you sparking a discussion about how Elvish could be changed to make handling "binary" data easier I feel you were premature in asking a change to the Elvish builtins be merged.

elves / elvish