I/O module: replace ioLiteral and ioNewline

mppf commented 2 years ago

Today we have records ‘ioLiteral’ and ‘ioNewline’ to differentiate from regular string reads/writes:

a string write might work with a format or iostyle to include quotes
a string read has to have some way to find end of string (e.g. looking for space, EOF, or end quote)
reading a newline/literal is more of a matching operation

This situation continues with our current Encoder/Decoder plans [18504]

read, readln, write, writeln will work with the channel’s Encoder/Decoder
readbits, writebits, other binary I/O calls will ignore the channel's Encoder/Decoder

However, ‘ioLiteral’ and ‘ioNewline’ as separate types seems unnecessarily confusing

Also interferes with desire to make ‘read’ use ‘out’ intent (discussed later)

Proposal:

Replace ‘ioLiteral’ and ‘ioNewline’ with ‘matchNewline’ ‘matchLiteral’ ‘writeNewline’ ‘writeLiteral’
These four channel methods will ignore the channel’s current Encoder/Decoder ‘matchLiteral’ will accept a string and an optional ignoreWhiteSpace=true argument
Open Q: Should ‘matchNewline’ and ‘matchLiteral’ return ‘bool’ or throw on failure?

bradcray commented 2 years ago

Would you be willing to put together an example showing these proposed features in action? I'm finding it too abstract to reason about very well.

mppf commented 2 years ago

Sure. Suppose you are working with a channel configured with a JSON encoder but you want to read some stuff that is not JSON. E.g. your input file is this:

arraydata
[1,2,3]

This is not in JSON format on its own, but [1,2,3] is a JSON array.

You could write the above file like this

var output = ... // create a channel with JSON encoding
var A:[1..3] int = 1..3;
output.writeLiteral("arraydata"); // writes literally arraydata, ignoring Encoder
output.writeNewline(); // writes the newline, ignoring Encoder
output.write(A); // writes [1,2,3]

Note in the above, since the channel is configured for JSON encoding, simply writing

output.write("arraydata");

would produce "arraydata" (i.e. with the quotes) since that is how JSON encodes strings.

As a result, writeLiteral / writeNewline will be useful when implementing the JSON encoder itself.

As we have talked about in other issues, I do expect to be able to choose a different Encoder/Decoder for specific write calls. However I'd like to have these more purpose-built functions available. For one thing, that way, when writing the JSON Encoder, it doesn't have to use the default encoder internally.

Now a reading example:

var input = ... // create a channel with JSON decoding
var A:[1..3] int;
if input.matchLiteral("arraydata") &&
   input.matchNewline() {
  // this block runs if the next data in input is literally arraydata followed by newline
  input.read(A); // reads [1,2,3]
}

bradcray commented 2 years ago

Great, thanks. I'm a fan of the general approach, though I'd prefer to replace match with read as the prefix for the reader versions. I'm guessing that the decision to use match is because nothing is being returned to the caller, as with most of our other read-related routines (is that right?). I'd stick with read to maintain the read/write duality in these routines, and because I think of reading as not necessarily being related to returning values so much as consuming data from a channel, advancing the offset, etc.

mppf commented 2 years ago

Re readLiteral vs matchLiteral, actually there are two reasons we might not want readLiteral:

It doesn't return anything (as you said)
I'd imagine we would want matchLiteral to return false if there was input but it didn't match the literal. But, reading (say) an int when there is just a name would result in a FormatError being raised rather than returning false, since we have read functions returning false only for EOF.

I think I'd rather have matchLiteral and have it return false if the literal wasn't found (instead of having readLiteral and having it return false only on EOF and throw if the literal wasn't found).

e-kayrakli commented 2 years ago

It doesn't return anything (as you said)

Can it return the literal it just read? Which is redundant, but consistent.

I'd imagine we would want matchLiteral to return false if there was input but it didn't match the literal. But, reading (say) an int when there is just a name would result in a FormatError being raised rather than returning false, since we have read functions returning false only for EOF.

My 2 cents: for this method in question it sounds OK to me if it throws a FormatError if the literal doesn't match rather than returning false. But I am not familiar with the use cases of these methods to argue strongly about it.

mppf commented 2 years ago

Since it's reading a literal, we already know what it will read. So the only useful thing it can do is to tell you if it was able to read the literal or not. Hence my preference for it to return false rather than throw if there was something other than the literal at that point in the file.

mppf commented 2 years ago

In terms of an example that would use readLiteral, suppose you are writing a CSV reader that reads integers into a list. You read an integer, then you need to check if there is a comma, or if it is the end of the line. If there is a comma, you read another integer. This pattern would be really awkward with try/catch.

benharsh commented 2 years ago

This may be tangential enough for its own issue, and may not be relevant for stabilization, but:

Would it make sense to have a kind of reader.consume(type T) and/or reader.consume(const x) method that works with the decoder and drops the read value on the floor? In a world with interfaces, we could have a Consumable that requires the implementation of proc Self.consumeThis(r : reader), or perhaps some kind of overload on readThis.

The ability to write r.consume(new Foo(...)) seems like it might come in handy.

Or the ability to write r.consume(Foo) if we're reading a bunch of different types from a file, and don't care about bringing Foos into memory.

mppf commented 2 years ago

Regarding reader.consume, I'm not really understanding why it's needed, when we already have reader.read(myThing) that could do it. Is the concern just that since read accepts something by ref, we cant write something likereader.read(new Foo())? This does not bother me except for the case of literals and newlines, which I think deserve their own method. Which is the purpose of this issue - we want, rather thanreader.read(new ioLiteral(","))to havereader.readLiteral(",")or maybereader.matchLiteral(","). I think this is a good argument for why readLiteral/matchLiteral should return bool.

So, IMO anyway, it's pretty clear what the signatures of these methods should be. What I think needs more discussion is which names we would like:

Option 1: matchLiteral / matchNewline 👍

proc reader.matchLiteral(lit: string): bool throws;
proc reader.matchLiteral(lit: bytes): bool throws;
proc reader.matchNewline(): bool throws;

Option 2: readLiteral / readNewline 🎉

proc reader.readLiteral(lit: string): bool throws;
proc reader.readLiteral(lit: bytes): bool throws;
proc reader.readNewline(): bool throws;

Option 3: consumeLiteral / consumeNewline 🚀

proc reader.consumeLiteral(lit: string): bool throws;
proc reader.consumeLiteral(lit: bytes): bool throws;
proc reader.consumeNewline(): bool throws;

bradcray commented 2 years ago

Ben, I take it your concept is motivated by wanting to process/consume a value of a given type but not spend the memory on it (?). Like maybe I'd want to do read(myObj); for a class/record C that had a field of 1,000,000 ints without necessarily wanting to allocate and free the memory for those million ints? That seems intriguing, though I suspect it'd also take some doing to have each type support a "read and drop on floor without creating an instance of me" method (which maybe the compiler could create for us?)

For me, this seems like the sort of thing we might keep in mind but put off until we / users have a motivating use-case for it (since adding it later could be a non-breaking change?).

benharsh commented 2 years ago

Yeah, that's the gist of it. I agree that we should put it off.

I prefer Option 1 (match*) of the three.

mppf commented 2 years ago

I prefer Option 1 (match*) of the three.

I updated https://github.com/chapel-lang/chapel/issues/19487#issuecomment-1147487533 to give each option an emoji so it is easy for people to indicate your preference. I turned the 👍 you already put on it into a vote for Option 1.

benharsh commented 2 years ago

I'm working on fixing up my IO operator deprecation PR, and finding that the pair of readLiteral and writeLiteral feels more natural to me, so I've altered my vote from before to Option 2.

Also, I just want to confirm: do you still see readLiteral or matchLiteral accepting an ignoreWhiteSpace argument? I believe this argument is only required during reading, so we shouldn't see this in writeLiteral, right?

mppf commented 2 years ago

I'm working on fixing up my IO operator deprecation PR, and finding that the pair of readLiteral and writeLiteral feels more natural to me, so I've altered my vote from before to Option 2.

Now the voting is very close. Perhaps the next step is to ask for opinions from more people.

Also, I just want to confirm: do you still see readLiteral or matchLiteral accepting an ignoreWhiteSpace argument? I believe this argument is only required during reading, so we shouldn't see this in writeLiteral, right?

Yes, I think that's important, I had just left it out of the name discussion. It was in the original post:

‘matchLiteral’ will accept a string and an optional ignoreWhiteSpace=true argument

Today my reaction is that it should be ignoreWhitespace=true because I think "whitespace" is a single word. (E.g. Wikipedia seems to think it is one word in https://en.wikipedia.org/wiki/Whitespace_character ).

benharsh commented 2 years ago

I'm fine with ignoreWhitespace being two words rather than three, as you propose.

I also noticed that ioNewline has a field skipWhitespaceOnly that defaults to false - do we want to replicate this with an optional argument to {read,match}Newline ?

mppf commented 2 years ago

I also noticed that ioNewline has a field skipWhitespaceOnly that defaults to false - do we want to replicate this with an optional argument to {read,match}Newline ?

No, I want readNewline/matchNewline to be able to skip whitespace but not to be able to skip letters and numbers. I think we should use readUntil for such a thing -- see #19769.

benharsh commented 2 years ago

Should readNewline/matchNewline only skip until the first newline literal it encounters? Right now ioNewline appears to skip consecutive newlines until the last one:


use IO;

proc main() {
  var f = openmem();
  {
    var w = f.writer();
    for i in 1..100 do w.write("\n");
    w.write(5);
  }
  {
    var r = f.reader();
    var ionl = new ioNewline();
    r.read(ionl);
    var x : int;
    r.read(x);
    writeln(x); // read x as '5'
  }
}

Though this could certainly be a bug.

I had overlooked the bool-returning functionality here, and I'm less certain it's the right choice. While returning false if a comma is not present is useful, it requires the user to check the returned result for other things - like the literals representing the bounds. It's incredibly easy to not check that result, and in doing so you would silently ignore an error. As a user I would have to write something like:

// reading the end of a list bounded by square brackets...
if !r.readLiteral("]") then throw <something>; // probably a FormatError?

I would find this more annoying than using try-catch to speculatively test for the presence of a literal.

Could we address this by having both readLiteral and matchLiteral methods? matchLiteral could return true if the literal was found and advance the channel position - otherwise it would return false.

I think this question extends beyond just literals though, so maybe we're asking "How should users speculatively read from a channel?".

benharsh commented 2 years ago

Offline we discussed the naming of these methods, and read was the stronger preference compared to match (7-3).

bradcray commented 2 years ago

it requires the user to check the returned result ... It's incredibly easy to not check that result,

This argument doesn't carry a lot of weight for me—specifically, I think this sort of thing implies to lots of routines (doesn't it?), and would prefer that we address the concern by implementing the proposed warning for ignored return values rather than designing in a way that assumes users won't. Specifically (and I've lost track now), wouldn't the same concern potentially apply to the read(expr1, expr2, expr3) style calls?

I could also imagine cases where the lack of a match wouldn't indicate an error so much as the end of a section, such as the last thing in a CSV file (s.t. being able to look at a boolean might be convenient and my reaction wouldn't be to throw but to terminate a while loop for example...

I think this question extends beyond just literals though, so maybe we're asking "How should users speculatively read from a channel?".

I think that's a good question (and potentially relates to my varargs read example above).

benharsh commented 2 years ago

Specifically (and I've lost track now), wouldn't the same concern potentially apply to the read(expr1, expr2, expr3) style calls?

Yes, and I believe I've expressed concern about that case elsewhere as well.

If I expect to read something, and cannot, then that feels more like an error to be thrown than a true/false result. If I have to check a boolean result here then I need to forward some kind of error anyways in the instances where I expected something. Writing this is kind of annoying even if I get a warning from the compiler about an unused result.

// annoying
if !f.readLiteral("[") then
  throw new FormatError(...);

do {
  push_back(f.read(int));
} while f.readLiteral(", "); // nice

// annoying
if !.freadLiteral("]") then
  throw new FormatError(...);

If we don't return a boolean, then this pattern might look like:

f.readLiteral("[");

var done = false;
while !done {
  push_back(f.read(int));
  try { f.readLiteral(", "); }
  catch { done = false; }
}

f.readLiteral("]");

I guess the utility of using the boolean in a while-loop isn't valuable enough to me in comparison with the extra checking in non-speculative cases, and the possibility of ignoring errors.

lydia-duncan commented 2 years ago

I could also imagine cases where the lack of a match wouldn't indicate an error so much as the end of a section, such as the last thing in a CSV file

My expectation is that in that case the error would be caught - error handling doesn't mean the behavior is bad in all scenarios, just that it can be and we're providing information about what happened. By using error handling, we give the user the ability to respond to the behavior - I think people would object if this became a halt for this very reason

bradcray commented 2 years ago

I think people would object if this became a halt for this very reason

To clarify, I wasn't suggesting this routine should halt (nor even that we remove the throws from the routine, which I think we'd still need for other exceptional conditions?). I was just pushing back on Ben's example of:

// reading the end of a list bounded by square brackets...
if !r.readLiteral("]") then throw <something>; // probably a FormatError?

by imagining the ease of reading a CSV of ints by writing:

do {
  const v = r.read(int);
} while (r.readLiteral(",");

where I wouldn't be interested in throwing my own error if the read failed, because it feels unnecessary and like a more roundabout / non-productive way of writing the pattern.

Ben's example of reading a CSV within square brackets is compelling to me for the square brackets, though being forced to write the while-loop as in his "always throws" version makes me sad (again, it feels very roundabout and anti-productive).

I asked about the "both read and match" proposal above in today's meeting, and re-read the description above, but still am not confident I've got the right understanding of it. Is the concept that the read() versions would throw on mismatches and return nothing while the match() versions would return a bool (and only throw on other file systems-y errors? Such that one could write my CSV of ints case as:

do {
  const v = r.read(int);
} while (r.matchLiteral(",");

and Ben's case with squre brackets as:

f.readLiteral("[");

do {
  push_back(f.read(int));
} while f.matchLiteral(", ");

f.readLiteral("]");

or am I off on the wrong path? If I've got it right, that seems like a pretty nice "cake and eat it too" solution (though I don't have an obvious mnemonic for remembering which is which... though I'm also not sure I would let that stand in the way).

[Ben, can you remind me whether your AoC codes are available somewhere? I ask because I used the bool-returning forms heavily in those codes, and every time I had to use a throw...catch form, I bristled at the additional typing given that the nature of the codes were quick-and-dirty / scripty].

benharsh commented 2 years ago

I asked about the "both read and match" proposal above in today's meeting, and re-read the description above, but still am not confident I've got the right understanding of it. Is the concept that the read() versions would throw on mismatches and return nothing while the match() versions would return a bool (and only throw on other file systems-y errors? Such that one could write my CSV of ints case as:
do {
  const v = r.read(int);
} while (r.matchLiteral(",");
and Ben's case with squre brackets as:
f.readLiteral("[");

do {
  push_back(f.read(int));
} while f.matchLiteral(", ");

f.readLiteral("]");
or am I off on the wrong path? If I've got it right, that seems like a pretty nice "cake and eat it too" solution (though I don't have an obvious mnemonic for remembering which is which... though I'm also not sure I would let that stand in the way).

You've got it right. My suggestion was something like this:

// always throws errors upwards
proc channel.readLiteral(lit:string, ignoreWhitespace=true) : void throws

// returns false if the literal was not present
// I think this means FormatErrors and EOF errors, not just any old error
proc channel.matchLiteral(lit:string, ignoreWhitespace=true) : bool throws

And then we could use those in the manner that you demonstrate in your second example.

An alternative approach would be to try and think of ways to generally handle speculative reads, and turning errors into bools. We could approach this from the language level with some hypothetical new try-based expression. Another idea I've had is to create a kind of "matching-view" over a reader channel, which supports the same methods and signatures with the difference of returning a boolean "false" if there was an error caught.

[Ben, can you remind me whether your AoC codes are available somewhere? I ask because I used the bool-returning forms heavily in those codes, and every time I had to use a throw...catch form, I bristled at the additional typing given that the nature of the codes were quick-and-dirty / scripty].

They aren't, and I'd be interested in going back over them having spent more time thinking about IO stuff recently. I agree that the try-catch stuff isn't good for quick/scripty/fun programs, but I suppose that I haven't been placing much importance on that use-case, personally.

bradcray commented 2 years ago

And then we could use those in the manner that you demonstrate in your second example.

I'm a fan, then. The only reservations I have are (a) where do we draw the line? when do we not create two versions of a routine? what makes this case warrant it more than others? and (b) what are the implications for other status-returning I/O routines.

I suppose that I haven't been placing much importance on that [quick/scripty] use-case, personally.

I think that continues to remain an important case for us to keep in mind: That we should continue trying to support both quick, scripty-style codes (for cases that don't need more; to attract new users) and also more bullet-proof approaches (for when the sketch needs to start becoming production code).

turning errors into bools

I thought about this too, remembering your previous sketches around such features, and continue to be intrigued by them, particularly since they'd help with the "when do we create two versions?" question above. But I'm also not sure I'd want to wait for that feature or rush it in for this case's sake. And even with the feature, I think there'd be extra typing/cruft that would feel not as nice as the above.

But popping back to the top, I agree that making sure we have a unified/coherent story for other speculative reads (or other status-returning reads? are those necessarily the same thing?) feels important before taking the step of adding both overloads.

Another idea I've had is to create a kind of "matching-view" over a reader channel, which supports the same methods and signatures with the difference of returning a boolean "false" if there was an error caught.

That's intriguing as well. Do you have a sketch of what it would look like?

benharsh commented 2 years ago

That's intriguing as well. Do you have a sketch of what it would look like?

In the while-bool case I was thinking of some kind of method on a channel that created a matching-view, so from the user's perspective:

// 'r' is a 'reader'
// the name 'matcher' is just a dummy example
while r.matcher().readLiteral(",") do ...

// or ...
manage r.matcher() as m { ... }

The development cost of maintaining this thing could be significant, as forwarding doesn't allow for us to generically intercept errors or returned values from forwarded methods. I'm not sure how we could (easily) make sure that our wrapper methods stayed in sync with the actual channel methods.

mppf commented 2 years ago

But popping back to the top, I agree that making sure we have a unified/coherent story for other speculative reads (or other status-returning reads? are those necessarily the same thing?) feels important before taking the step of adding both overloads.

Well, one issue is whether the channel position changes or not. Other than saying that a user can use mark/revert if they don't want it to change - I will ignore that for now.

Other than that, it seems that if we have a read call that throws due to a FormatError, we can handle it in a try/catch. I guess I don't see why that is so terrible. Somebody writing CSV parsing code can write a little function to do it, along the lines of

proc checkForComma(ch) : bool {
  try {
    ch.readLiteral(",");
    return true;
  } catch FormatError {
    return false;
  }
}

This does not bother me. I don't think all the patterns have to be short.

(I do think it is reasonable to think about language features that can do this kind of thing; I don't remember if we have issues about them.)

Anyway, even if we have some language feature to write the more general case, I think the situation coming up when reading literals is going to be frequent enough to justify both readLiteral and matchLiteral so having both sounds great to me.

benharsh commented 2 years ago

OK, I think we're converging on supporting a "read" form and a "match" form. In that case, I'm going to propose the following unstable methods for the 1.28 release:

// All of these (read/write/match) will be marked unstable for 1.28

proc channel.writeLiteral(lit:string) : void throws
proc channel.writeLiteral(lit:bytes) : void throws
proc channel.writeNewline() : void throws

// Returns nothing, throws all errors
proc channel.readLiteral(lit:string, ignoreWhitespace=true) : void throws
proc channel.readLiteral(lit:bytes, ignoreWhitespace=true) : void throws
proc channel.readNewline() : void throws

// Returns "true" if the literal was found, in which case the channel advances
// Returns "false" on any FormatError or EOFError encountered, in which case the channel does *not* advance
proc channel.matchLiteral(lit:string, ignoreWhitespace=true) : bool throws
proc channel.matchLiteral(lit:bytes, ignoreWhitespace=true) : bool throws
proc channel.matchNewline() : bool throws

Some open questions:

What should readNewline and matchNewline do about multiple consecutive newlines?
Are we content with the name "matchLiteral"? Or is it too similar and easily confused with "readLiteral"?

I should also point out that readWriteLiteral and readWriteNewline haven't been deprecated yet. If we're not ready to pull the trigger on these, and feel that either the name or signature would change for 1.29, then we could lean on those "readWrite" methods until we're more confident.

bradcray commented 2 years ago

What should readNewline and matchNewline do about multiple consecutive newlines?

I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about ioNewline consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?

Are we content with the name "matchLiteral"? Or is it too similar and easily confused with "readLiteral"?

I definitely don't have a mnemonic for keeping track of which is which, but am also not sure that it's a big enough deal to worry about (in the "I can read the manual to remind myself sense").

One other option which I'm not expecting to be greeted with enthusiasm would be to just have readNewLine(), have it take an optional bool param saying whether or not to return a bool (where the default would be to not), and use where clauses and overloads to get the right signatures. But I'd rather just learn what match vs. read in these contexts than pursue this myself.

mppf commented 2 years ago

What should readNewline and matchNewline do about multiple consecutive newlines?

I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about ioNewline consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?

I don't know why it is that way today. The only thing I can guess is that it is related to the "ignore whitespace" rule for ioLiteral and maybe it thinks of these earlier newlines as whitespace. I don't have any particular preference for what it should do, though.

Are we content with the name "matchLiteral"? Or is it too similar and easily confused with "readLiteral"?

I definitely don't have a mnemonic for keeping track of which is which, but am also not sure that it's a big enough deal to worry about (in the "I can read the manual to remind myself sense").

I'm happy with the names "match" vs "read" here. To me, "match" is something I want to try doing in a conditional, but "read" is something I want to do unconditionally, so it matches my thinking. Of course we could have readLiteral and tryReadLiteral or something, but I like match better than these.

One other option which I'm not expecting to be greeted with enthusiasm would be to just have readNewLine(), have it take an optional bool param saying whether or not to return a bool (where the default would be to not), and use where clauses and overloads to get the right signatures. But I'd rather just learn what match vs. read in these contexts than pursue this myself.

I wouldn't be upset by this strategy in this case. However I like readLiteral and matchLiteral better.

lydia-duncan commented 2 years ago

// All of these (read/write/match) will be marked unstable for 1.28

proc channel.writeLiteral(lit:string) : void throws
proc channel.writeLiteral(lit:bytes) : void throws
proc channel.writeNewline() : void throws

For what it's worth, I don't think the write variants need to be marked unstable, I think it was pretty clear what we'd do with them

benharsh commented 2 years ago

What should readNewline and matchNewline do about multiple consecutive newlines?

I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about ioNewline consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?

I don't know why it is that way today. The only thing I can guess is that it is related to the "ignore whitespace" rule for ioLiteral and maybe it thinks of these earlier newlines as whitespace. I don't have any particular preference for what it should do, though.

I wanted to follow-up on this briefly. Reading a newline was not in fact skipping over multiple newlines. It was the following call to channel.read:

    var x : int;
    r.read(x);

Sorry for the false alarm!

benharsh commented 1 year ago

There were a couple of open questions I raised when I first implemented this functionality. I have a couple of proposed answers to these questions that would hopefully allow for us to move forward with what we have already implemented.

What if a literal contains leading whitespace, and ignoreWhitespace is true (for read/match methods)?

Currently the behavior of readLiteral and matchLiteral is to ignore leading whitespace in the given string literal when ignoreWhitespace=true. This means that a given string like " test" (with three leading spaces) will match the text "test". Setting ignoreWhitespace=false will allow the method to correctly consider leading whitespace in the desired string literal.

We have a couple of options here:

A) Regard the current behavior as a bug, and update the implementation to never ignore leading whitespace in the desired string literal.

B) Regard the current behavior as a "feature", where ignoreWhitespace=true allows for more flexible matching of text. Taking this approach would require users to be more careful about setting ignoreWhitespace=false when whitespace actually matters.

I am currently leaning towards option A. I don't think that either of these should stand in the way of stabilizing these methods.

Should the read/match newline methods contain an ignoreWhitespace=true argument?

The current behavior for readNewline and matchNewline is consistent with ignoreWhitespace=true. I believe we can add an optional argument like this at a later date if desired without breaking any user code, and so I don't think an answer here should delay in stabilizing these methods.

bradcray commented 1 year ago

@benharsh : With respect to the readLiteral()/matchLiteral() behavior w.r.t. whitespace: During AoC 2022 (specifically the monkey exercise), I struggled a lot with when a readf() would or would not skip over whitespace. For example, I believe if I failed to consume a newline and then did readf("age: %i", myInt);, it would complain because the newline didn't match age. In the end, I don't think I found any cases that seemed wrong or inconsistent, but it was a stumbling block for me. I bring it up because I think it would make sense for the default behavior of matchLiteral("Ben") to probably be the same as readf("Ben"); (this may be a no-brainer, but I wanted to mention it since I don't think readf() has come up on this issue).

What I'm less certain of is what matchLiteral(" Ben"); should do. I think in a readf() context, a whitespace character means "match any amount of whitespace" (?). Should matchLiteral() do the same, or take the more literal approach of "I must literally match a single space and no other whitespace". I don't have a strong opinion about what should happen there, but wanted to mention the issue.

benharsh commented 1 year ago

Sorry @bradcray, I was just reviewing this issue and realized I didn't reply to your comment.

@benharsh : With respect to the readLiteral()/matchLiteral() behavior w.r.t. whitespace: During AoC 2022 (specifically the monkey exercise), I struggled a lot with when a readf() would or would not skip over whitespace. For example, I believe if I failed to consume a newline and then did readf("age: %i", myInt);, it would complain because the newline didn't match age. In the end, I don't think I found any cases that seemed wrong or inconsistent, but it was a stumbling block for me. I bring it up because I think it would make sense for the default behavior of matchLiteral("Ben") to probably be the same as readf("Ben"); (this may be a no-brainer, but I wanted to mention it since I don't think readf() has come up on this issue).

I'm not exactly sure whether you're saying the readf behavior here is desirable or not. Either way, the current behavior for matchLiteral is to ignore whitespace by default, which would avoid the problem with newlines. Here's an example program to demonstrate:


use IO;

proc main() {
  var f = openMemFile();
  {
    f.writer().write("Hello\nWorld\n");
  }

  // readf
  {
    try {
      var r = f.reader();
      r.readf("Hello");
      r.readf("World"); // will throw a bad format error
      writeln("no readf errors");
    } catch e {
      writeln(e.message());
    }
  }
  writeln("-----");

  // matchLiteral
  {
    try {
      var r = f.reader();
      writeln(r.matchLiteral("Hello"));
      writeln(r.matchLiteral("World")); // true
    } catch e {
      writeln(e.message());
    }
  }
  writeln("-----");

  // matchLiteral: ignoring whitespace
  {
    try {
      var r = f.reader();
      writeln(r.matchLiteral("Hello", ignoreWhitespace=false));
      writeln(r.matchLiteral("World", ignoreWhitespace=false)); // false
    } catch e {
      writeln(e.message());
    }
  }
}

What I'm less certain of is what matchLiteral(" Ben"); should do. I think in a readf() context, a whitespace character means "match any amount of whitespace" (?). Should matchLiteral() do the same, or take the more literal approach of "I must literally match a single space and no other whitespace". I don't have a strong opinion about what should happen there, but wanted to mention the issue.

To me, the answer here depends on the ignoreWhitespace argument. Let's say our desired literal string is " Foo", and our channel is pointing to the characters:

Foo (that's five spaces)

If ignoreWhitespace=true, then I think it's reasonable to ignore the first four spaces, match the fifth, and then Foo successfully. If we're not ignoring whitespace, which is the opt-in behavior, then I think it's reasonable to not match the text.

But perhaps the real question is what the default should be for ignoreWhitespace?

lydia-duncan commented 1 year ago

Summary for the ad-hoc sub-team discussion next week:

1. What is the API being presented?

We're intending to replace ioLiteral and ioNewline with:

proc fileReader.readLiteral(literal:string, ignoreWhitespace=true) : void throws
proc fileReader.readLiteral(literal:bytes, ignoreWhitespace=true) : void throws
proc fileReader.readNewline() : void throws

// returns 'false' if literal could not be matched, or on EOF
proc fileReader.matchLiteral(literal:string, ignoreWhitespace=true) : bool throws
proc fileReader.matchLiteral(literal:bytes, ignoreWhitespace=true) : bool throws
proc fileReader.matchNewline() : bool throws

proc fileWriter.writeLiteral(literal:string) : void throws
proc fileWriter.writeLiteral(literal:bytes) : void throws
proc fileWriter.writeNewline() : void throws

These methods ignore the serializer/deserializer. Both string and bytes are accepted. readLiteral will throw if something unexpected is found, while matchLiteral will return a bool, which is better for use in looping contexts.

They are intended to use when the contents are expected to be consistent, rather than varying based on the value of a field or variable. For instance, if a file should always start with a specific header, that header could be read or written using these methods. This also useful when using serializers/deserializers where strings are otherwise expected to have surrounding quotation marks.

We still have a couple of questions we want to resolve: A. For readLiteral and matchLiteral, if a literal contains leading whitespace and ignoreWhitespace is true, what should we do?

The current behavior today is to ignore leading whitespace when matching, meaning that a given string like " test" (with three leading spaces) will match the text "test".

B. Should readNewline and matchNewline also get an ignoreWhitespace=true option?

The current behavior today is consistent with having that option and setting it to true.

How is it being used in Arkouda and CHAMPS?

Arkouda has 12 uses in src/compat/*/ArkoudaTimeCompat.chpl that it copied from the Time module to maintain compatibility between versions. There are no uses outside of the compatibility module.

CHAMPS has 6 uses of ioNewline in its vec3 file's writeThis routines.

2. What's the history of the feature, if it already exists?

ioLiteral and ioNewline are records that were added with the rest of the QIO support 12 years ago. They were intended especially to enable code to be written that handled both reading and writing, but we've seen little evidence that supporting both in the same method is really useful, hence splitting them into reading and writing methods. The record aspect of them in particular has seemed unnecessary and also can be problematic when resolving other problems (see related issues below)

3. What's the precedent in other languages, if they support it?

a. In Python, you can pass os.linesep though I don't know if this is done much in practice. b. C/C++ - C++'s std::endl handles the insertion of a newline character when writing. I think using literals with the << operator works in a similar way c. Rust - I'm not seeing an equivalent in Rust d. Swift - I'm not seeing an equivalent in Swift e. Julia - I'm not seeing an equivalent in Julia f. Go - I'm not seeing an equivalent in Go

4. Are there known Github issues with the feature?

I don't think this is because of ioLiteral, but it does give an error message involving it: https://github.com/chapel-lang/chapel/issues/20004
https://github.com/chapel-lang/chapel/issues/16449
proposed as part of a solution for https://github.com/chapel-lang/chapel/issues/8748
a standalone version was suggested in https://github.com/chapel-lang/chapel/issues/21221
is included in part of this proposal: https://github.com/chapel-lang/chapel/issues/14108

5. Are there features it is related to? What impact would changing or adding this feature have on those other features?

can be used in tandem with mark/commit/revert for speculative reads
mentioned as part of an equivalence relationship between read and readln here: https://github.com/chapel-lang/chapel/issues/14107
One of the proposals in https://github.com/chapel-lang/chapel/issues/19498 involved reducing uses of them
- Which is also mentioned here: https://github.com/chapel-lang/chapel/issues/18504

6. How do you propose to solve the problem?

For question A (if a literal contains leading whitespace and ignoreWhitespace is true, what should we do?): A1. Regard the current behavior as a bug, and update the implementation to never ignore leading whitespace in the desired string literal.

A2. Regard the current behavior as a "feature", where ignoreWhitespace=true allows for more flexible matching of text. Taking this approach would require users to be more careful about setting ignoreWhitespace=false when whitespace actually matters.

Ben favors (and I agree with) option A1.

For question B (Should readNewline and matchNewline also get an ignoreWhitespace=true option?): B1. Skip doing this for now, current behavior matches ignoreWhitespace=true so adding the argument wouldn't break existing code

B2. Add the option now

B3. Add the option now and set the default to false, where the methods would return true only if the newline was the first whitespace character encountered from where the fileReader currently is in the file.

Ben favors (and I agree with) option B1

lydia-duncan commented 1 year ago

Discussion summary:

We decided to follow option A1 (leading whitespace in the desired string literal should be taken into account w.r.t. whether the string matches)
- We talked about the implementation impact of this - we think the right strategy is to skip past leading whitespace in the file, then look back the number of leading whitespace characters in the literal to ensure they match. We can reconsider this if the implementation is too tricky.
- We need to write tests with complicated whitespace combinations to ensure this is handled correctly, e.g. mixing whitespace characters.
We decided to follow B1 (don't change the default behavior, but intend to add ignoreWhitespace argument in future). The API decision has been made about it, we can document it, and the implementation should hopefully be easy. Whether this addition does happen before 2.0 or not is unimportant and based on the prioritization of other efforts.

Discussion notes:

Michael: why is writeLiteral different from writeString?
- A: It allows the programmer to describe their intent, it doesn't take a length argument
- Jeremiah: with writeString, won't it support non-UTF8 in the future? Was that something that would matter for writeLiteral?
- Michael: that's more about the difference between string and bytes: fileWriter can have encoding that is not utf8, writeLiteral would also do the character set encoding. It's pretty parallel
Michael: could consider throwing if we are passed a literal with leading whitespace and ignoreWhitespace is set to true
Michael: would prefer to not involve regex to solve this problem
- Ben: what if arguments with leading whitespace throw illegalArgumentErrors and then we relax that later?
- Michael: stopping throwing is a breaking change
(on to matchNewline topic) Ben: having this argument implies ignoring whitespace that wasn't a newline
Jeremiah: could we rename to "ignoreOtherWhitespace"?
- Ben: doesn't consider the current name confusing

chapel-lang / chapel