I/O module: adjustments to read methods

chapel-lang / chapel

a Productive Parallel Programming Language

https://chapel-lang.org

Other

1.79k stars 421 forks source link

I/O module: adjustments to read methods #19498

Closed mppf closed 1 year ago

mppf commented 2 years ago

Today we have these:

proc channel.read(ref args...) : bool throws
proc channel.read(type t) : t throws
proc channel.read(type t...) : tuple throws

In [#18496] discussed if we want both type and return-through-argument versions
- Seems that we want both forms because return-through-argument can return bool & is easier for loops
‘read(string)’ today just reads one word but that’s a common stumbling block [#7952, #15844]
‘read(myClass)’ fails if ‘myClass’ is ‘nil’ but arguably should allocate a class [#7950]

Proposal:

proc reader.read(out args..., decoder=reader.decoder()) : bool throws
proc reader.read(type t, decoder=reader.decoder()) : t throws
proc reader.read(type t..., decoder=reader.decoder()) : tuple throws

Keep both type and non-type forms
Use ‘out’ instead of ‘ref’ (requires also moving away from ‘ioLiteral’ and ‘ioNewline’) & allocate class instances
- Note: this will require addressing https://github.com/chapel-lang/chapel/issues/17198 -- otherwise implementation won't be able to behave differently for read(myint) vs read(mystring)
- including using out for top-level read and readln functions
With the default Decoder:
- ‘read(string)’ or ‘read(myString)’ should read until EOF and same for bytes
- (works with the default Encoder where e.g. ‘write(“Hello World”)’ will not output quotes)
- a read call containing a string as anything other than the last argument will throw
The ‘out’ version will return ‘false’ and default initialize the arguments if it reached EOF
It will produce a compilation error if any argument type is not default initializeable
It will return ‘true’ for success or throw if there was an error, including a partial read before EOF
The ‘type t’ version will throw if it reaches EOF

bradcray commented 2 years ago

‘read(string)’ or ‘read(myString)’ should read until EOF and same for bytes

I obviously missed the discussion, but this seems like a bad and error-prone idea to me. HPC tends to deal with lots of massive files, and it seems crazy to create such a simple way to accidentally read a massive amount of data into a string, likely hitting an OOM along the way. While I'd be inclined to support ways to read an entire file into a string, bytes, or array, I'd propose doing it with an explicit readFile() routine (symmetric to a readline() or readLine()) rather than via something that I might learn about in a programming 101 class and try to write in Chapel:

writeln("What is your first name?")
var name: string;
read(name);  // error: rather than reading my name and going on, it blocks until an EOF is entered
writeln("What is your age?");
var age: int;
read(age);

Is there precedent for this interpretation in other languages? Is it one that people would miss if we required them to use readFile() or the like instead?

I suspect the motivation for this proposal is "We can't guess where the written string originally ended, so what else can we do but read everything?" But that seems strictly worse than our current behavior given that either can be surprising if you don't know what's happening, where I think I'd prefer to be surprised by having my read consume less than I expected them (potentially way) more. It seems like the safer choice between the extremes to me.

Moreover, the proposal seems to make cases in which strings don't contain whitespace (which I think are not completely uncommon) more complex. I.e., if I have a program that doesn't use any strings with spaces:

writeln("Hi");
writeln("Michael");
writeln(2022);

I can read it back using reads in a straightforward manner:

var greeting, name: string, year: int;
read(greeting);
read(name);
read(year);
// or read(greeting, name, year);
// or read(string, string, int);

whereas under the proposed change, I have to use readString() or readline() to handle such a case (I think?). Which is obviously valid, but seems like overkill to require for cases like the above to me.

All this makes me wonder what approach other popular languages take for deciding when to stop reading a string.

Without doing the research, and if people objected to the current behavior (which I'm fine with), I'd be inclined to propose that all read() overloads take a stringDelimiter=... argument that indicates what marker(s) is/are used to stop reading strings (where I'd imagine the default to be something like whitespace or " \t\n" in order to preserve backwards compatibility).

mppf commented 2 years ago

HPC tends to deal with lots of massive files, and it seems crazy to create such a simple way to accidentally read a massive amount of data into a string, likely hitting an OOM along the way. While I'd be inclined to support ways to read an entire file into a string, bytes, or array, I'd propose doing it with an explicit readFile() routine (symmetric to a readline() or readLine())

This is a good point. However I don't think this issue is unique to Chapel or HPC.

Is there precedent for this interpretation in other languages? Is it one that people would miss if we required them to use readFile() or the like instead?

Python does it this way -- see https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects

To read a file’s contents, call f.read(size), which reads some quantity of data and returns it as a string (in text mode) or bytes object (in binary mode). size is an optional numeric argument. When size is omitted or negative, the entire contents of the file will be read and returned; it’s your problem if the file is twice as large as your machine’s memory.

I suspect the motivation for this proposal is "We can't guess where the written string originally ended, so what else can we do but read everything?" But that seems strictly worse than our current behavior given that either can be surprising if you don't know what's happening, where I think I'd prefer to be surprised by having my read consume less than I expected them (potentially way) more. It seems like the safer choice between the extremes to me.

I disagree about it being worse. I think that if you test your program at all you are likely to find a problem where it reads the whole file where you expected it to read one word. But, in contrast, a bug where it reads one word but you expected something else (a whole line? a file? everything up to a colon?) will only be discovered if the input happens to include a space.

All this makes me wonder what approach other popular languages take for deciding when to stop reading a string.

Python read without length reads the entire file https://docs.python.org/3/tutorial/inputoutput.html#methods-of-file-objects
Swift has some things that always read a whole file; NSInputStream you have to give a length to read https://developer.apple.com/documentation/foundation/nsinputstream/1411544-read
Rust seems to have only reading into an array and read_to_string which reads until EOF (and doesn't have any other way to limit the length) https://doc.rust-lang.org/std/io/trait.Read.html#tymethod.read
at a quick look I am only finding read into array functionality in Go. They have ReadAll, ReadAtLeast, but it seems you have to call other stuff to convert into a string. https://pkg.go.dev/io

The only precedent I know of for Chapel's "read a string" behavior is the way scanf works in C, but I don't view "read until whitespace" as nearly as objectionable in the context of a format string. Anyway, I think of scanf("%s", mystring) not as "read a string from the file" but rather some other formatted I/O thing. For one thing it is called "scan".

Without doing the research, and if people objected to the current behavior (which I'm fine with), I'd be inclined to propose that all read() overloads take a stringDelimiter=... argument that indicates what marker(s) is/are used to stop reading strings (where I'd imagine the default to be something like whitespace or " \t\n" in order to preserve backwards compatibility).

IMO here we run into challenges with the Encoder/Decoder proposal. In particular, I think we want read(myString) or read(string) to work through the Decoder, so if it's configured for JSON, it should read a quoted JSON string. At the same time, we have been saying that some I/O methods will ignore the Encoder/Decoder; so we have a way out of that. However, I don't think it makes sense for an I/O method to ignore the Encoder/Decoder only when it has certain optional arguments. (And I don't think we should make optional arguments for read that only work with the default Decoder, either).

So I would rather have read(string) always work with the Encoder/Decoder and do whatever the default there is; and then if we want to have a function to read a string while stopping on a particular character (or at a particular length) then we have new functions to do that. That is what #19496 is proposing - so put another way, I think read(string) should always do some default thing (based on the Decoder) but that readUtf8 / readString / whatever we end up with in #19496 could have an optional argument to indicate a delimiter.

bradcray commented 2 years ago

However I don't think this issue is unique to Chapel or HPC.

Perhaps, though my experience is that running out of memory seems to result in worse problems on HPC systems than conventional ones (killing programs without saying why, bringing entire machines down). Would you disagree, or have I just been horribly unlucky?

And if reading a single file into the memories of multiple compute nodes, it seems as though running on an HPC system (or, more generally, a distributed memory system) would potentially use bigger files than shared memory systems (though I guess desktop systems have the benefit of virtual memory).

IMO here we run into challenges with the Encoder/Decoder proposal.

Good point, and now I see that the places where things were rubbing me wrong were more in what the default decoder would do than anything specific to the routines themselves (which only makes it better in that someone could always use a different decoder if they felt unhappy with our choice; but only epsilon easier for us to resolve since we still have to pick what those defaults are).

Python does it this way

That makes me want to weep. And I take it people use this form all the time?

Something not immediately clear to me: Does the decoder get to say what a 0-argument read does? Does it get to control the return type, or must it be string? And in cases where it does return string, distinct decoders could presumably return different strings for a specific file?

FWIW, I don't find it as objectionable to have reader.read() read the entire file and return it as a string as I do having reader.read(string) (a) read the whole file and (b) not match the behavior of readf("%t", myString); (where the second of these is perhaps almost more objectionable to me). I don't know if that compromise option is technically possible or palatable to you, but if the default decoder could have that behavior, I'd be pretty happy with it I think.

Sanity check which I should probably know / am probably forgetting: Does readf("%t", myString); go through the decoder as well?

mppf commented 2 years ago

Adding links to a few tidbits from discussion on another issue:

Pascal read(myString) will read a line and leave the newline in the input (see https://github.com/chapel-lang/chapel/issues/19495#issuecomment-1090244673 for an example)
15844 is an issue where the current read-until-whitespace behavior was confusing to somebody

mppf commented 2 years ago

However I don't think this issue is unique to Chapel or HPC.

Perhaps, though my experience is that running out of memory seems to result in worse problems on HPC systems than conventional ones (killing programs without saying why, bringing entire machines down). Would you disagree, or have I just been horribly unlucky?

Only that it is perhaps more likely to be a shared resource. I've had plenty of problems on my own machine from running out of memory.

And if reading a single file into the memories of multiple compute nodes, it seems as though running on an HPC system (or, more generally, a distributed memory system) would potentially use bigger files than shared memory systems (though I guess desktop systems have the benefit of virtual memory).

In practice using virtual memory on a desktop system can make it unusable.

Python does it this way

That makes me want to weep. And I take it people use this form all the time?

I am not an authority on that part but it does come up in S.O. answers (e.g. https://stackoverflow.com/questions/53204752/how-do-i-read-a-text-file-as-a-string ) so it is at the very least regularly recommended.

Something not immediately clear to me: Does the decoder get to say what a 0-argument read does?

I think we could arrange for the decoder to take some action on a 0-argument read.

Does it get to control the return type, or must it be string?

I don't think the Decoder can influence the return type. If you have read(string) then I think it should always return string, regardless of Decoder.

BTW this discussion points out to me that we can't change read from ref to out without addressing #17198 - because for read we want for the call site, not the called function, to determine the types of the returned out values.

And in cases where it does return string, distinct decoders could presumably return different strings for a specific file?

Yes, absolutely. I have in mind a default decoder and a JSON decoder, as examples. With input like "some words in quotes" on a read(string) a JSON decoder on would return the string containing some words in quotes but the default decoder would return something else (maybe the whole line; maybe the whole file; maybe the first word - as we are discussing here).

FWIW, I don't find it as objectionable to have reader.read() read the entire file and return it as a string as I do having reader.read(string) (a) read the whole file and (b) not match the behavior of readf("%t", myString); (where the second of these is perhaps almost more objectionable to me). I don't know if that compromise option is technically possible or palatable to you, but if the default decoder could have that behavior, I'd be pretty happy with it I think.

Sanity check which I should probably know / am probably forgetting: Does readf("%t", myString); go through the decoder as well?

Yes and I am expecting that readf(%t", myString) will have exactly the same behavior as read(myString). This might not have been obvious. So, with the above proposal, readf(%t", myString) would also read a whole file, with the default decoder. Of course the formatted I/O system has other ways of saying more about reading a string (I can't remember off hand which of those we plan to deprecate - but I would expect that it is the job of %s format strings to allow one to specify these things).

In terms of what reader.read(string) does, for something like JSON it seems pretty clear to me what it should do (and it doesn't sound like that is a point of contention here). So we are just talking about what the default Decoder should do. Taking a broad viewpoint, I can think of these options:

read until any whitespace—which in practice means reading a word (what we have today, but caused confusion in #15844)
read until a newline but leave the newline in the input (what Pascal does)
read until EOF (what Python does and what is proposed above)
read one character / code point
throw an error "I don't know how much to read"

(Are you aware of other options?)

We also have the choice about whether we want reader.readln(string) to represent some kind of combination of operations or does something totally different, with the default decoder. In particular #19499 proposes that with the default decoder reads one line and does not return the newline; but in #19495 we are discussing perhaps removing readln altogether.

bradcray commented 2 years ago

Yes and I am expecting that readf(%t", myString) will have exactly the same behavior as read(myString). This might not have been obvious. So, with the above proposal, readf(%t", myString) would also read a whole file, with the default decoder.

Probably needless to say, but I'm not a fan of this, probably reflecting my C heritage.

read until EOF (what Python does and what is proposed above)

Just to make sure I'm not missing anything. Does Python have a form similar to read(string) or read(myStringVar), or is it just read() that happens to return a string that contains the whole file? (my guess since it doesn't have types/typed variables?)

If my guess is correct, I'd still very much prefer to only have read() read the whole file and for read(string) or read(myStringVar) to break on whitespace (where I'm imagining that there is not a read(type t = string) overload, but a 0-argument version and a "must pass a type" version). I'd argue that that gives Python programmers the convenience they're accustomed to while also not being surprising to C/C++ programmers (who, I'd argue, are nearly as important a constituency for us). And then support a readFile(string), readFile(bytes), readFile(myStringVar), readFile(myBytesVar), readFile(myArray) for users who want to read the whole file into a specific type or variable. I think it also gets out of the dead-end of "Once you've read a string, you can't read anything else", which doesn't really appeal to me (again, as a C programmer, and someone who is accustomed to reading bits of a file at a time without having to specify string lengths).

One other thing I was wondering about last night: Should a whole-file myFile.read() return a bytes rather than a string? Seems like it from a generality and performance perspective (not constrained to UTF-8).

mppf commented 2 years ago

read until EOF (what Python does and what is proposed above)

Just to make sure I'm not missing anything. Does Python have a form similar to read(string) or read(myStringVar), or is it just read() that happens to return a string that contains the whole file? (my guess since it doesn't have types/typed variables?)

As far as I know, read() happens to return a string/bytes and you can ask it to read a particular length.

One other thing I was wondering about last night: Should a whole-file myFile.read() return a bytes rather than a string? Seems like it from a generality and performance perspective (not constrained to UTF-8).

I would expect that there is a way to ask for bytes or to ask for a string. If the file ends up not being valid UTF-8 when you ask for string, it should throw an error.

mppf commented 2 years ago

Yes and I am expecting that readf("%t", myString) will have exactly the same behavior as read(myString). This might not have been obvious. So, with the above proposal, readf(%t", myString) would also read a whole file, with the default decoder.

Probably needless to say, but I'm not a fan of this, probably reflecting my C heritage.

I do not understand the issue here. You can have readf("%s", myString) along with the various modifiers to keep the C programmer in you (or anyone else) happy and that can be as similar to scanf as you want. readf("%t", something) is intended to be a way to integrate the regular read behavior within a format string and is not something that exists in C. By definition, it should do the same thing as read(string).

But maybe that is not the part you are complaining about—maybe you agree %t and read should be consistent but we are just arguing about what readf("%t", myString) and read(string) should do? I do not understand why the format string variant of it would cause you to lean one way or the other. My reaction is "If you want something like scanf, you should be using "%s".

If my guess is correct, I'd still very much prefer to only have read() read the whole file and for read(string) or read(myStringVar) to break on whitespace (where I'm imagining that there is not a read(type t = string) overload, but a 0-argument version and a "must pass a type" version).

I'm still really bothered by the break-on-whitespace idea for read(string). A big part of my problem with it is the implied behavior of readln(string) not reading a line. I know you have been putting of discussing #19499 until we make some progress on these other issues, but I'm unable to come to terms with read(string) breaking on any whitespace without at least having a clear direction for what will happen to readln(string). FWIW, if we actually did what Pascal does, we would not have a problem:

read(string) would read a line (and leave the newline in the input)
readln(string) would read a line and consume the newline (and not included it in the string)

I also think that having read(string) simply throw with the default decoder is not a bad choice. It would make it more obvious that, with the default decoder, you can't reliably read in the result of a write of a record like record R { var x: string; }, because the string is written without quotes, so there is no way to tell where it ends. (Note that this record could not be read back in if we choose to make read(string) read until EOF or read until newline, either). If we made read(string) throw with the default decoder, we would be in effect saying that if you want to do string input and aren't working with something like JSON, you should use other mechanisms like readline and readf("%s"). I think that would be a good thing, because those other mechanisms have closer relationships to things people would be familiar with in other languages (Python and C, specifically). Perhaps we could even arrange for it to be a compilation error, rather than a runtime error, in common cases, like when working with stdin.

Moving from an example with a record to an example with an array, if you have an array of strings, and you try to read it in, do we really want to assume that all of the strings should be one word (or for that matter, one character or one line)? Making that assumption, and leaving it as a landmine for the programmer to find when they have non-trivial input, seems worse to me than simply not supporting that operation.

I'd argue that that gives Python programmers the convenience they're accustomed to while also not being surprising to C/C++ programmers (who, I'd argue, are nearly as important a constituency for us).

I remain unconvinced that C programmers will look at read(string) and say "Oh, that obviously reads one word". IMO it has no analogue in C and it is equally reasonable for it to read one Unicode code point. Or to read until a punctuation mark.

With C++ the story is different. If you do e.g. std::cin >> mystring then indeed it will read one word (stopping on whitespace). However, this is not something I would have remembered without looking it up since I generally use C I/O facilities for reading even in C++ code. I do not know if it would trip people up or not. At a surface level, the syntax is very different between read(mystring) and stream >> mystring).

FWIW, as far as I can tell, if you ask "How do I read one word from a file in Python?" the answer is "Read a line and then use split to get the words". Meaning that Python has no I/O operation that does that specific thing. If Python didn't bother to include it, why should it be our default?

And then support a readFile(string), readFile(bytes), readFile(myStringVar), readFile(myBytesVar), readFile(myArray) for users who want to read the whole file into a specific type or variable.

I have no problem with readFile() or read() reading the whole file. But that does not help us in any way in deciding what read(string) does. The reason the OP proposal here is that it reads the whole file is that:

it is a well defined operation even though we can't know where a string ends
misunderstandings related to it are likely to be caught early in testing

Arguably having read(string) throw with the default decoder is an even stronger solution to the second problem. What do you think about it?

bradcray commented 2 years ago

Here, I'm continuing to respond to your previous comment from yesterday with some more thinking this morning (and without reading your latest yet—oh and now there are two latest). I'd previously asked:

Does the decoder get to say what a 0-argument read does?

And you suggested that "yes, it probably could" which I took at face-value yesterday but am mulling over a bit more this morning.

Does [the decoder] get to control the return type [of a 0-argument read], or must it be string?

And you effectively said all decoders will have to share the same return type, which makes sense to me. Let's say the return type is string for now (even though I proposed that maybe bytes would be better). Then I asked:

And in cases where [the 0-argument routine] does return string, distinct decoders could presumably return different strings for a specific file?

And you said:

Yes, absolutely. I have in mind a default decoder and a JSON decoder, as examples. With input like "some words in quotes" on a read(string) a JSON decoder on would return the string containing some words in quotes but the default decoder would return something else (maybe the whole line; maybe the whole file; maybe the first word - as we are discussing here).

But this focuses on the "read a string" case (where I understand and agree that different decoders might handle it differently) and not the "0-argument" / "default decoder will read the full file" case which is what I was more curious about. I suspect the reason for that may be that you're thinking of read() as being an invocation of read(type t = string) that relies on the default type, but since I'm thinking of the read(type t) version as potentially not having a default type and the read() routine as being its own thing, I feel differently about this example. Specifically, let's say that the full file had the contents:

"start"
123 456
"end"

Considering this file with a JSON decoder, it makes sense to me that read(string) would read in "start" and save it as a string storing start. And similarly, it makes sense that a C decoder (say) would read in "start" and save it as a string storing "start". And that a Python decoder might have read(string) read in the whole file as a single string.

I also imagine that a readFile(string) call for the Python or C decoder would read the whole file and return it as a string. And probably the same would be true for a JSON decoder, since the contents of the file are not a single quoted string, and we've asked it to read the whole file? [*] If so, this suggests to me either that the readFile(string) routine may not be as subject to the decoder as typed reads are, or perhaps just that many text-based decoders may implement it identically even though their typed reads might differ. I suppose I can imagine that there's some other decoder that would process the file's contents in some different way while reading the whole thing that generated a different string by computing it as it went...? But I don't have an example in mind.

[* = alternatively, I could imagine the JSON decoder saying "The whole file is not a single quoted string, so I'm going to throw an error", but that doesn't seem as helpful of an implementation of readFile(string), which is why I'm not expecting it].

Anyway, that leads me to think that it seems as though it would be similarly confusing for a 0-argument read (which again, I'm thinking of distinctly from read(string) while also trying to make it read the whole file for familiarity to Python programmers) to have dramatically different behaviors across decoders for a given file. Specifically, for code comprehension, it seems as though it should generally consume the whole file for portability across decoders, even though the string returned might be different in some cases (as suggested in the last speculative sentences of the previous paragraph). As an example, I'd definitely want the Python-style decoder's 0-argument read() of the file above to return the whole file as a string. And I would say that the C-style decoder's read() of it should also return the whole file as a string for consistency. But if the JSON decoder's read() of the file were to only return the first line as a string containing "start", that seems as though it would be too different a behavior from the other decoders to make a 0-argument read() call comprehensible in the code. So I'd also expect it to read and return the full file.

That leaves me in a similar state as readFile(string): That most (if not all) decoders should probably treat it the same way. So if I'd been answering my question:

And in cases where [the 0-argument routine] does return string, distinct decoders could presumably return different strings for a specific file?

with my current mindframe, I'd say "they could, but I imagine that most common text-based decoders will not differ in their handling of it."

mppf commented 2 years ago

For the moment, let's imagine that we add readFile(string) and readFile(bytes). I don't think this method should work with the Decoder at all. It is always well defined what it does.

Now, what if we called it read() instead? Well then we would have to make the choice between having go through the Decoder (which you discussed a bit) or having some methods named read work with the Decoder and others where the Decoder has no impact.

So, it would be my preference to simply name it readFile which is arguably clearer anyway.

bradcray commented 2 years ago

You can have readf("%s", myString) along with the various modifiers to keep the C programmer in you (or anyone else) happy ... By definition, [readf("%t", myString)] should do the same thing as read(string).

I'd go one step further and say that it should naturally be the case that read(string) == readf("%t", myString) == readf("%s", myString) for a given decoder. I don't think the last should act differently from the other two.

Specifically, I think of "%s" on our readf() as meaning "read a string, and the actual that matches this % had better be a string or else the user has made a mistake. Whereas I view "%t" as being used either to say:

"I'm in a generic programming setting, and I don't know whether the next argument will be a string, bytes, integer or whatever, so just print it in its default format because I can't/won't take the trouble to specify it precisely" or
"I've learned that in Chapel I don't have to use %i vs. %s vs. %r vs %b, which require type matching in the argument list and may lead to errors, so I just use %t because I'm lazy / it's a lot more convenient." The latter definitely describes my mindset when doing I/O a lot of the time.

To make sure I haven't missed anything: Are there other existing (or planned) cases where readf("%t", someArg) would behave differently than replacing "%t" with the type-specific % formatter? If so, that worries me. If not, then why would we want to treat strings differently in this regard?

A big part of my problem with it is the implied behavior of readln(string) not reading a line.

I think that, for now, we should continue to just pretend we're getting rid of readln(). And that when you Google "How do I read a line into a string in Chapel?" in 10 years, the SO answer you'll get back is readline(string), making for a familiar routine for Python users and an intuitive one for everyone else (again, where I'm not particularly worried about enticing Python programmers to Chapel).

I remain unconvinced that C programmers will look at read(string) and say "Oh, that obviously reads one word". IMO it has no analogue in C

Well sure, but if we teach them that read(varOfSomeType) is equivalent to readf("%t", varOfSomeType) which is equivalent to readf("%-formatter-for-this-type", varOfSomeType) then the nature of "what it means to read in a single string" is preserved and they have an even more convenient way to do it.

and it is equally reasonable for it to read one Unicode code point. Or to read until a punctuation mark.

Sure, and a given decoder could certainly do that, right? But if we don't have a compelling precedent for this choice from a popular language that we're trying to entice users from, then I don't think we should make our default decoder do this.

I'm not keen on making read(string) an error for similar reasons: If readf("%t", myString) and readf("%s", myString) are supported, it seems arbitrary to say "you can't write read(string), but can write the other two." And if you can't write any of the three, then the C programmer is unhappy. Meanwhile, if Python doesn't have a similarly commonly used equivalent to C's "read one whitespace-separated word" (which seems to be the case, based on some Googling, quick polling, and your own response), then why penalize the C programmer for wanting to use something that Python users wouldn't reach for anyway? (where my sense is that they'd reach for either readline() + split or a 0-argument, whole-file read() instead?)

I also think that when not using a non-ambiguous file format like JSON, basic text I/O is, by nature, problematic in terms of whether you can always read what you've written, and don't see this problem as being particularly specific to strings. So, taking that to be the nature of the universe, I don't feel the need to protect users from such surprises by locking them out of this capability.

bradcray commented 2 years ago

For the moment, let's imagine that we add readFile(string) and readFile(bytes). I don't think this method should work with the Decoder at all. It is always well defined what it does.

OK.

Now, what if we called it read() instead? Well then we would have to make the choice between having go through the Decoder (which you discussed a bit) or having some methods named read work with the Decoder and others where the Decoder has no impact.

I'd probably still support a 0-argument read() that bypasses the decoder and simply returns the whole file as a bytes. I think we've already got routines with the read*() prefix where some go through the decoder and others don't, so I don't see this as being all that different. I think the benefits are that it would give Python programmers what they'd expect and agree that file.read() looks and sounds like it's reading the whole file. I wouldn't insist that we support a 0-argument read, but I don't see a problem with it either.

The reason I was thinking maybe one would want it to go through a decoder is (1) that I wasn't sure whether you were imagining it would or not, so didn't want to cut off that possibility, and (2) that I was imagining it might be cool to have a GIF decoder whose full-file .read() would return an array of pixels or an a gif record or a something-else decodeer whose whole-file read would do something else. But this would require different decoders to have different return types, which didn't seem like it was in the cards.

And while I can imagine a decoder whose 0-argument reader might return a different bytes value than others depending on the file's contents, I don't have a compelling example in mind.

lydia-duncan commented 2 years ago

To make sure I haven't missed anything: Are there other existing (or planned) cases where readf("%t", someArg) would behave differently than replacing "%t" with the type-specific % formatter? If so, that worries me. If not, then why would we want to treat strings differently in this regard?

Not that I'm aware of from doing a look through the format specifiers.

I should note that I find the "this should be equivalent to readf(%s, myString) compelling.

mppf commented 2 years ago

You can have readf("%s", myString) along with the various modifiers to keep the C programmer in you (or anyone else) happy ... By definition, [readf("%t", myString)] should do the same thing as read(string).

I'd go one step further and say that it should naturally be the case that read(string) == readf("%t", myString) == readf("%s", myString) for a given decoder. I don't think the last should act differently from the other two.

I definitely do not agree.

Given that we plan Decoder support which applies to some, but not all, I/O calls; would we want readf("%s") to ignore the Decoder even as readf("%t", myString) === read(string) works with the decoder?

Let's suppose for the moment that we are working with a JSON decoder. What are some of the things that %s can do? Do they make any sense for JSON?

https://chapel-lang.org/docs/modules/standard/IO/FormattedIO.html#string-and-bytes-conversions

%17s -- when reading - read up to 17 bytes or a whitespace, whichever comes first, rounding down to whole characters
%.17s -- when reading - read exactly 17 Unicode codepoints

IMO these are things that C programmers are used to with scanf. But I would expect them to be complete nonsense if we are reading a JSON-quoted string with them. For that reason, I think that %s should completely ignore the Decoder; while %t works with it.

We also have %/<regex>/ which I think is even less reasonable as something that works through the Decoder.

To make sure I haven't missed anything: Are there other existing (or planned) cases where readf("%t", someArg) would behave differently than replacing "%t" with the type-specific % formatter? If so, that worries me. If not, then why would we want to treat strings differently in this regard?

There are many things you can read (or write) with %t that don't have single-character format string equivalents; e.g. a record or tuple or array.

I do not view it as a given that readf("%i", myinteger) goes through the Decoder at all. It is my view that probably it should not. E.g. say you have a Decoder that is some binary format (e.g. Google Protocol Buffer's wire format - but for the sake of argument let's suppose it has all integers as 8-bytes of big endian binary data). Then if you do readf("%i", i), what should happen? It is my view that readf / writef are for textual I/O, so it should try to read a textual number (e.g. 123) and not something like an 8-byte integer. (That was the justification for deprecating the binary format strings). In contrast, readf("%t", i) is designed to work with the Decoder, so necessarily must read 8-bytes of big endian integer in the example - since it should behave the same as read(i).

I'm not keen on making read(string) an error for similar reasons: If readf("%t", myString) and readf("%s", myString) are supported, it seems arbitrary to say "you can't write read(string), but can write the other two." And if you can't write any of the three, then the C programmer is unhappy.

None of these would be what I would propose. If we are making read(string) throw an error with the default decoder, then you could still write readf("%s", myString). readf("%t", myString) is just another way of writing read(string) so would also throw. If you want to read a line, you would use readline.

Meanwhile, if Python doesn't have a similarly commonly used equivalent to C's "read one whitespace-separated word" (which seems to be the case, based on some Googling, quick polling, and your own response), then why penalize the C programmer for wanting to use something that Python users wouldn't reach for anyway? (where my sense is that they'd reach for either readline() + split or a 0-argument, whole-file read() instead?)

I do not understand why supporting readf("%s", myString) qualifies as "penalizing the C programmer". It is similar to scanf and what C programmers are used to.

I also think that when not using a non-ambiguous file format like JSON, basic text I/O is, by nature, problematic in terms of whether you can always read what you've written, and don't see this problem as being particularly specific to strings. So, taking that to be the nature of the universe, I don't feel the need to protect users from such surprises by locking them out of this capability.

I am not trying to lock people out of capabilities. I think there should be a way to write the various I/O patterns. I would assume we agree about that. (Perhaps we are using different meanings of "capability" here - I would say "capability" means "you can use the I/O system to do XYZ, one way or another" but perhaps your view read(string) itself as a capability?).

However the challenge with mixing something like JSON and not-working-through-the-Encoder-Decoder I/O (which I think you called "basic text I/O" but it applies also to mixing in binary I/O) is that it needs to be clear to the user which situation they are working in.

This is the reason that:

the initial Encoder/Decoder proposal suggested that we use writer.encode and reader.decode(type) to make it super obvious which I/O calls would be working with JSON
later proposals (after the discussion lead to Encoder/Decoder working with write/read) use different method names for things that work with the Encoder/Decoder than for things that do not

As we have dug up here, readf/writef is a real challenge in this regard, since if we have %t it cannot always ignore the Encoder/Decoder. But I don't think the right answer is to make it always use the Encoder/Decoder, either. I think we have to say "Be careful using readf/writef with an Encoder/Decoder; here are what the various format specifiers do".

benharsh commented 2 years ago

Regarding the symmetry of writing and then reading the same strings, it seems potentially worth pointing out an example with %t that might provide a different perspective (sorry if this implied sans-code and I missed it):

use IO;

proc main() {
  var myString = "Hello, world!";

  var mem = openmem();
  // writes string wrapped in quotes
  mem.writer().writef("%t", myString);

  var s : string;
  // reads entire quote-wrapped string
  mem.reader().readf("%t", s);
  writeln("readf'd %t: ", s);

  // reads up to whitespace
  mem.reader().readf("%s", s);
  writeln("readf'd %s: ", s);
}

On main today the readf-%t option prints Hello, world!, whereas the readf-%s option prints "Hello,. Does this perspective of symmetry change things for anyone?

Then again, even with the options being discussed it seems there will be a symmetry issue with strings that contain whitespace, so perhaps it's a moot point?

f.writeln(myString); // Hello, world!
f.writeln(myString);

var x = f.read(string); // either ``Hello,`` or ``Hello, world!\nHello, world!\n``

I was interested in trying to get a sense of what other languages do in their standard libraries for the notion of "just read" and "read a string", if one exists. I looked at rust, golang, julia, python, and java. I'll note some relevant details below, but generally they distinguish between IO and parsing/formatted-text. I mainly saw the following kinds of capabilities:

main IO functions/methods read entire file, or
read line by line, or
read chunks into byte array

Generally users in these languages will take the string they read, and do some splitting/parsing on that string. I won't claim to have a thorough or deep understanding of the IO in these languages, but these seemed to be the predominant patterns presented to users in various tutorials and at places like stackoverflow.

There are some relevant functions or ideas from these languages:

python does either file-as-string, line-as-string, or fixed-byte-array with its standard library io methods. It doesn't seem to have a built-in "give me the next thing" capability that I could readily find in the standard library.

golang's notion of formatted IO (scanf/printf) has some relevant options for strings:

%s : reads a whitespace delimited word
%v : like our "generic" %t option, but reads just a word
%q : quoted, escaped strings

java has a Scanner object that has methods like nextInt() and nextLine(), but not nextString(). It has a simple next() method that returns the next token/word, and would act like C's %s in our context. It's also worth pointing out that a Scanner seems to be a fairly common recommendation for java IO.

rust's File::read method accepts an allocated byte array buf as an argument, and returns an integer in [0,buf.len()] indicating the amount read. It also has fs::read_into_string("path/to/my/file.txt") -> String, and fs::read("path") -> Result<Vec<u8>> that both read the entire file.

julia has a read(io::IO, T) function that should "Read a single value of type T from io", and explicitly states for read(io::IO, String) that it reads in the entire file.

Hopefully this is more useful than it is a noisy grab-bag.

mppf commented 2 years ago

On main today the readf-%t option prints Hello, world!, whereas the readf-%s option prints "Hello,. Does this perspective of symmetry change things for anyone?

Huh, I did not remember that writef("%t\n", "hello world"); will include quotes in the output. That seems odd since %t is supposed to work like just write ing that.

Thanks for the notes about the behavior of different languages.

lydia-duncan commented 2 years ago

The vote in the meeting was strongly in favor of not throwing when the default decoder is used with strings.

Brad wanted further discussion to be had on the topic of if %t should be deprecated

bradcray commented 2 years ago

Brad wanted further discussion to be had on the topic of if %t should be deprecated

Adding some detail to this, I've found a lot of value in "%t" less from the "call this type's encoder/decoder" perspective and more from the "You know what this argument's type is, so please just print it out the default way rather than making me tell you its type." This is useful both for being lazy (writef("%t", myInt);) and for generic programming contexts (writef("%t", myGenericScalarArg);). So I was proposing that rather than deprecating it completely, we either replace it with a new format string that chooses between %b, %i, %r, etc. based on the formal argument type, or repurpose "%t" to have this effect. In either case, I think the main change is that things like record types wouldn't be able to be printed with "%t" anymore.

mppf commented 2 years ago

I've created #19906 about %t specifically. Let's take any further discussion on that point to that issue.

lydia-duncan commented 1 year ago

I believe everything requested by this issue has been resolved, closing

chapel-lang / chapel

I/O module: adjustments to read methods #19498

15844 is an issue where the current read-until-whitespace behavior was confusing to somebody