Closed mppf closed 1 year ago
Would you be willing to put together an example showing these proposed features in action? I'm finding it too abstract to reason about very well.
Sure. Suppose you are working with a channel configured with a JSON encoder but you want to read some stuff that is not JSON. E.g. your input file is this:
arraydata
[1,2,3]
This is not in JSON format on its own, but [1,2,3]
is a JSON array.
You could write the above file like this
var output = ... // create a channel with JSON encoding
var A:[1..3] int = 1..3;
output.writeLiteral("arraydata"); // writes literally arraydata, ignoring Encoder
output.writeNewline(); // writes the newline, ignoring Encoder
output.write(A); // writes [1,2,3]
Note in the above, since the channel is configured for JSON encoding, simply writing
output.write("arraydata");
would produce "arraydata"
(i.e. with the quotes) since that is how JSON encodes strings.
As a result, writeLiteral / writeNewline will be useful when implementing the JSON encoder itself.
As we have talked about in other issues, I do expect to be able to choose a different Encoder/Decoder for specific write calls. However I'd like to have these more purpose-built functions available. For one thing, that way, when writing the JSON Encoder, it doesn't have to use the default encoder internally.
Now a reading example:
var input = ... // create a channel with JSON decoding
var A:[1..3] int;
if input.matchLiteral("arraydata") &&
input.matchNewline() {
// this block runs if the next data in input is literally arraydata followed by newline
input.read(A); // reads [1,2,3]
}
Great, thanks. I'm a fan of the general approach, though I'd prefer to replace match
with read
as the prefix for the reader versions. I'm guessing that the decision to use match
is because nothing is being returned to the caller, as with most of our other read-related routines (is that right?). I'd stick with read
to maintain the read
/write
duality in these routines, and because I think of reading as not necessarily being related to returning values so much as consuming data from a channel, advancing the offset, etc.
Re readLiteral vs matchLiteral, actually there are two reasons we might not want readLiteral:
read
functions returning false only for EOF.I think I'd rather have matchLiteral and have it return false if the literal wasn't found (instead of having readLiteral and having it return false only on EOF and throw if the literal wasn't found).
It doesn't return anything (as you said)
Can it return the literal it just read? Which is redundant, but consistent.
I'd imagine we would want matchLiteral to return false if there was input but it didn't match the literal. But, reading (say) an int when there is just a name would result in a FormatError being raised rather than returning false, since we have read functions returning false only for EOF.
My 2 cents: for this method in question it sounds OK to me if it throws a FormatError if the literal doesn't match rather than returning false. But I am not familiar with the use cases of these methods to argue strongly about it.
Since it's reading a literal, we already know what it will read. So the only useful thing it can do is to tell you if it was able to read the literal or not. Hence my preference for it to return false
rather than throw if there was something other than the literal at that point in the file.
In terms of an example that would use readLiteral
, suppose you are writing a CSV reader that reads integers into a list
. You read an integer, then you need to check if there is a comma, or if it is the end of the line. If there is a comma, you read another integer. This pattern would be really awkward with try/catch.
This may be tangential enough for its own issue, and may not be relevant for stabilization, but:
Would it make sense to have a kind of reader.consume(type T)
and/or reader.consume(const x)
method that works with the decoder and drops the read value on the floor? In a world with interfaces, we could have a Consumable
that requires the implementation of proc Self.consumeThis(r : reader)
, or perhaps some kind of overload on readThis
.
The ability to write r.consume(new Foo(...))
seems like it might come in handy.
Or the ability to write r.consume(Foo)
if we're reading a bunch of different types from a file, and don't care about bringing Foo
s into memory.
Regarding reader.consume
, I'm not really understanding why it's needed, when we already have reader.read(myThing)
that could do it. Is the concern just that since read
accepts something by ref
, we cant write something like
reader.read(new Foo())? This does not bother me except for the case of literals and newlines, which I think deserve their own method. Which is the purpose of this issue - we want, rather than
reader.read(new ioLiteral(","))to have
reader.readLiteral(",")or maybe
reader.matchLiteral(","). I think this is a good argument for why readLiteral
/matchLiteral
should return bool
.
So, IMO anyway, it's pretty clear what the signatures of these methods should be. What I think needs more discussion is which names we would like:
proc reader.matchLiteral(lit: string): bool throws;
proc reader.matchLiteral(lit: bytes): bool throws;
proc reader.matchNewline(): bool throws;
proc reader.readLiteral(lit: string): bool throws;
proc reader.readLiteral(lit: bytes): bool throws;
proc reader.readNewline(): bool throws;
proc reader.consumeLiteral(lit: string): bool throws;
proc reader.consumeLiteral(lit: bytes): bool throws;
proc reader.consumeNewline(): bool throws;
Ben, I take it your concept is motivated by wanting to process/consume a value of a given type but not spend the memory on it (?). Like maybe I'd want to do read(myObj);
for a class/record C
that had a field of 1,000,000 ints without necessarily wanting to allocate and free the memory for those million ints? That seems intriguing, though I suspect it'd also take some doing to have each type support a "read and drop on floor without creating an instance of me" method (which maybe the compiler could create for us?)
For me, this seems like the sort of thing we might keep in mind but put off until we / users have a motivating use-case for it (since adding it later could be a non-breaking change?).
Yeah, that's the gist of it. I agree that we should put it off.
I prefer Option 1 (match*) of the three.
I prefer Option 1 (match*) of the three.
I updated https://github.com/chapel-lang/chapel/issues/19487#issuecomment-1147487533 to give each option an emoji so it is easy for people to indicate your preference. I turned the 👍 you already put on it into a vote for Option 1.
I'm working on fixing up my IO operator deprecation PR, and finding that the pair of readLiteral
and writeLiteral
feels more natural to me, so I've altered my vote from before to Option 2.
Also, I just want to confirm: do you still see readLiteral
or matchLiteral
accepting an ignoreWhiteSpace
argument? I believe this argument is only required during reading, so we shouldn't see this in writeLiteral
, right?
I'm working on fixing up my IO operator deprecation PR, and finding that the pair of
readLiteral
andwriteLiteral
feels more natural to me, so I've altered my vote from before to Option 2.
Now the voting is very close. Perhaps the next step is to ask for opinions from more people.
Also, I just want to confirm: do you still see
readLiteral
ormatchLiteral
accepting anignoreWhiteSpace
argument? I believe this argument is only required during reading, so we shouldn't see this inwriteLiteral
, right?
Yes, I think that's important, I had just left it out of the name discussion. It was in the original post:
‘matchLiteral’ will accept a string and an optional ignoreWhiteSpace=true argument
Today my reaction is that it should be ignoreWhitespace=true
because I think "whitespace" is a single word. (E.g. Wikipedia seems to think it is one word in https://en.wikipedia.org/wiki/Whitespace_character ).
I'm fine with ignoreWhitespace
being two words rather than three, as you propose.
I also noticed that ioNewline has a field skipWhitespaceOnly
that defaults to false
- do we want to replicate this with an optional argument to {read,match}Newline
?
I also noticed that ioNewline has a field
skipWhitespaceOnly
that defaults tofalse
- do we want to replicate this with an optional argument to{read,match}Newline
?
No, I want readNewline
/matchNewline
to be able to skip whitespace but not to be able to skip letters and numbers. I think we should use readUntil
for such a thing -- see #19769.
Should readNewline
/matchNewline
only skip until the first newline literal it encounters? Right now ioNewline appears to skip consecutive newlines until the last one:
use IO;
proc main() {
var f = openmem();
{
var w = f.writer();
for i in 1..100 do w.write("\n");
w.write(5);
}
{
var r = f.reader();
var ionl = new ioNewline();
r.read(ionl);
var x : int;
r.read(x);
writeln(x); // read x as '5'
}
}
Though this could certainly be a bug.
I had overlooked the bool-returning functionality here, and I'm less certain it's the right choice. While returning false
if a comma is not present is useful, it requires the user to check the returned result for other things - like the literals representing the bounds. It's incredibly easy to not check that result, and in doing so you would silently ignore an error. As a user I would have to write something like:
// reading the end of a list bounded by square brackets...
if !r.readLiteral("]") then throw <something>; // probably a FormatError?
I would find this more annoying than using try-catch to speculatively test for the presence of a literal.
Could we address this by having both readLiteral
and matchLiteral
methods? matchLiteral
could return true
if the literal was found and advance the channel position - otherwise it would return false.
I think this question extends beyond just literals though, so maybe we're asking "How should users speculatively read from a channel?".
Offline we discussed the naming of these methods, and read
was the stronger preference compared to match
(7-3).
it requires the user to check the returned result ... It's incredibly easy to not check that result,
This argument doesn't carry a lot of weight for me—specifically, I think this sort of thing implies to lots of routines (doesn't it?), and would prefer that we address the concern by implementing the proposed warning for ignored return values rather than designing in a way that assumes users won't. Specifically (and I've lost track now), wouldn't the same concern potentially apply to the read(expr1, expr2, expr3)
style calls?
I could also imagine cases where the lack of a match wouldn't indicate an error so much as the end of a section, such as the last thing in a CSV file (s.t. being able to look at a boolean might be convenient and my reaction wouldn't be to throw but to terminate a while
loop for example...
I think this question extends beyond just literals though, so maybe we're asking "How should users speculatively read from a channel?".
I think that's a good question (and potentially relates to my varargs read example above).
Specifically (and I've lost track now), wouldn't the same concern potentially apply to the read(expr1, expr2, expr3) style calls?
Yes, and I believe I've expressed concern about that case elsewhere as well.
If I expect to read something, and cannot, then that feels more like an error to be thrown than a true/false result. If I have to check a boolean result here then I need to forward some kind of error anyways in the instances where I expected something. Writing this is kind of annoying even if I get a warning from the compiler about an unused result.
// annoying
if !f.readLiteral("[") then
throw new FormatError(...);
do {
push_back(f.read(int));
} while f.readLiteral(", "); // nice
// annoying
if !.freadLiteral("]") then
throw new FormatError(...);
If we don't return a boolean, then this pattern might look like:
f.readLiteral("[");
var done = false;
while !done {
push_back(f.read(int));
try { f.readLiteral(", "); }
catch { done = false; }
}
f.readLiteral("]");
I guess the utility of using the boolean in a while-loop isn't valuable enough to me in comparison with the extra checking in non-speculative cases, and the possibility of ignoring errors.
I could also imagine cases where the lack of a match wouldn't indicate an error so much as the end of a section, such as the last thing in a CSV file
My expectation is that in that case the error would be caught - error handling doesn't mean the behavior is bad in all scenarios, just that it can be and we're providing information about what happened. By using error handling, we give the user the ability to respond to the behavior - I think people would object if this became a halt for this very reason
I think people would object if this became a halt for this very reason
To clarify, I wasn't suggesting this routine should halt (nor even that we remove the throws
from the routine, which I think we'd still need for other exceptional conditions?). I was just pushing back on Ben's example of:
// reading the end of a list bounded by square brackets...
if !r.readLiteral("]") then throw <something>; // probably a FormatError?
by imagining the ease of reading a CSV of ints by writing:
do {
const v = r.read(int);
} while (r.readLiteral(",");
where I wouldn't be interested in throwing my own error if the read failed, because it feels unnecessary and like a more roundabout / non-productive way of writing the pattern.
Ben's example of reading a CSV within square brackets is compelling to me for the square brackets, though being forced to write the while-loop as in his "always throws" version makes me sad (again, it feels very roundabout and anti-productive).
I asked about the "both read and match" proposal above in today's meeting, and re-read the description above, but still am not confident I've got the right understanding of it. Is the concept that the read() versions would throw on mismatches and return nothing while the match() versions would return a bool (and only throw on other file systems-y errors? Such that one could write my CSV of ints case as:
do {
const v = r.read(int);
} while (r.matchLiteral(",");
and Ben's case with squre brackets as:
f.readLiteral("[");
do {
push_back(f.read(int));
} while f.matchLiteral(", ");
f.readLiteral("]");
or am I off on the wrong path? If I've got it right, that seems like a pretty nice "cake and eat it too" solution (though I don't have an obvious mnemonic for remembering which is which... though I'm also not sure I would let that stand in the way).
[Ben, can you remind me whether your AoC codes are available somewhere? I ask because I used the bool-returning forms heavily in those codes, and every time I had to use a throw...catch form, I bristled at the additional typing given that the nature of the codes were quick-and-dirty / scripty].
I asked about the "both read and match" proposal above in today's meeting, and re-read the description above, but still am not confident I've got the right understanding of it. Is the concept that the read() versions would throw on mismatches and return nothing while the match() versions would return a bool (and only throw on other file systems-y errors? Such that one could write my CSV of ints case as:
do { const v = r.read(int); } while (r.matchLiteral(",");
and Ben's case with squre brackets as:
f.readLiteral("["); do { push_back(f.read(int)); } while f.matchLiteral(", "); f.readLiteral("]");
or am I off on the wrong path? If I've got it right, that seems like a pretty nice "cake and eat it too" solution (though I don't have an obvious mnemonic for remembering which is which... though I'm also not sure I would let that stand in the way).
You've got it right. My suggestion was something like this:
// always throws errors upwards
proc channel.readLiteral(lit:string, ignoreWhitespace=true) : void throws
// returns false if the literal was not present
// I think this means FormatErrors and EOF errors, not just any old error
proc channel.matchLiteral(lit:string, ignoreWhitespace=true) : bool throws
And then we could use those in the manner that you demonstrate in your second example.
An alternative approach would be to try and think of ways to generally handle speculative reads, and turning errors into bools. We could approach this from the language level with some hypothetical new try-based expression. Another idea I've had is to create a kind of "matching-view" over a reader channel, which supports the same methods and signatures with the difference of returning a boolean "false" if there was an error caught.
[Ben, can you remind me whether your AoC codes are available somewhere? I ask because I used the bool-returning forms heavily in those codes, and every time I had to use a throw...catch form, I bristled at the additional typing given that the nature of the codes were quick-and-dirty / scripty].
They aren't, and I'd be interested in going back over them having spent more time thinking about IO stuff recently. I agree that the try-catch stuff isn't good for quick/scripty/fun programs, but I suppose that I haven't been placing much importance on that use-case, personally.
And then we could use those in the manner that you demonstrate in your second example.
I'm a fan, then. The only reservations I have are (a) where do we draw the line? when do we not create two versions of a routine? what makes this case warrant it more than others? and (b) what are the implications for other status-returning I/O routines.
I suppose that I haven't been placing much importance on that [quick/scripty] use-case, personally.
I think that continues to remain an important case for us to keep in mind: That we should continue trying to support both quick, scripty-style codes (for cases that don't need more; to attract new users) and also more bullet-proof approaches (for when the sketch needs to start becoming production code).
turning errors into bools
I thought about this too, remembering your previous sketches around such features, and continue to be intrigued by them, particularly since they'd help with the "when do we create two versions?" question above. But I'm also not sure I'd want to wait for that feature or rush it in for this case's sake. And even with the feature, I think there'd be extra typing/cruft that would feel not as nice as the above.
But popping back to the top, I agree that making sure we have a unified/coherent story for other speculative reads (or other status-returning reads? are those necessarily the same thing?) feels important before taking the step of adding both overloads.
Another idea I've had is to create a kind of "matching-view" over a reader channel, which supports the same methods and signatures with the difference of returning a boolean "false" if there was an error caught.
That's intriguing as well. Do you have a sketch of what it would look like?
That's intriguing as well. Do you have a sketch of what it would look like?
In the while-bool case I was thinking of some kind of method on a channel that created a matching-view, so from the user's perspective:
// 'r' is a 'reader'
// the name 'matcher' is just a dummy example
while r.matcher().readLiteral(",") do ...
// or ...
manage r.matcher() as m { ... }
The development cost of maintaining this thing could be significant, as forwarding doesn't allow for us to generically intercept errors or returned values from forwarded methods. I'm not sure how we could (easily) make sure that our wrapper methods stayed in sync with the actual channel methods.
But popping back to the top, I agree that making sure we have a unified/coherent story for other speculative reads (or other status-returning reads? are those necessarily the same thing?) feels important before taking the step of adding both overloads.
Well, one issue is whether the channel position changes or not. Other than saying that a user can use mark/revert if they don't want it to change - I will ignore that for now.
Other than that, it seems that if we have a read
call that throws due to a FormatError, we can handle it in a try/catch. I guess I don't see why that is so terrible. Somebody writing CSV parsing code can write a little function to do it, along the lines of
proc checkForComma(ch) : bool {
try {
ch.readLiteral(",");
return true;
} catch FormatError {
return false;
}
}
This does not bother me. I don't think all the patterns have to be short.
(I do think it is reasonable to think about language features that can do this kind of thing; I don't remember if we have issues about them.)
Anyway, even if we have some language feature to write the more general case, I think the situation coming up when reading literals is going to be frequent enough to justify both readLiteral and matchLiteral so having both sounds great to me.
OK, I think we're converging on supporting a "read" form and a "match" form. In that case, I'm going to propose the following unstable methods for the 1.28 release:
// All of these (read/write/match) will be marked unstable for 1.28
proc channel.writeLiteral(lit:string) : void throws
proc channel.writeLiteral(lit:bytes) : void throws
proc channel.writeNewline() : void throws
// Returns nothing, throws all errors
proc channel.readLiteral(lit:string, ignoreWhitespace=true) : void throws
proc channel.readLiteral(lit:bytes, ignoreWhitespace=true) : void throws
proc channel.readNewline() : void throws
// Returns "true" if the literal was found, in which case the channel advances
// Returns "false" on any FormatError or EOFError encountered, in which case the channel does *not* advance
proc channel.matchLiteral(lit:string, ignoreWhitespace=true) : bool throws
proc channel.matchLiteral(lit:bytes, ignoreWhitespace=true) : bool throws
proc channel.matchNewline() : bool throws
Some open questions:
I should also point out that readWriteLiteral
and readWriteNewline
haven't been deprecated yet. If we're not ready to pull the trigger on these, and feel that either the name or signature would change for 1.29, then we could lean on those "readWrite" methods until we're more confident.
What should readNewline and matchNewline do about multiple consecutive newlines?
I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about ioNewline
consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?
Are we content with the name "matchLiteral"? Or is it too similar and easily confused with "readLiteral"?
I definitely don't have a mnemonic for keeping track of which is which, but am also not sure that it's a big enough deal to worry about (in the "I can read the manual to remind myself sense").
One other option which I'm not expecting to be greeted with enthusiasm would be to just have readNewLine()
, have it take an optional bool param saying whether or not to return a bool (where the default would be to not), and use where clauses and overloads to get the right signatures. But I'd rather just learn what match vs. read in these contexts than pursue this myself.
What should readNewline and matchNewline do about multiple consecutive newlines?
I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about
ioNewline
consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?
I don't know why it is that way today. The only thing I can guess is that it is related to the "ignore whitespace" rule for ioLiteral and maybe it thinks of these earlier newlines as whitespace. I don't have any particular preference for what it should do, though.
Are we content with the name "matchLiteral"? Or is it too similar and easily confused with "readLiteral"?
I definitely don't have a mnemonic for keeping track of which is which, but am also not sure that it's a big enough deal to worry about (in the "I can read the manual to remind myself sense").
I'm happy with the names "match" vs "read" here. To me, "match" is something I want to try doing in a conditional, but "read" is something I want to do unconditionally, so it matches my thinking. Of course we could have readLiteral
and tryReadLiteral
or something, but I like match
better than these.
One other option which I'm not expecting to be greeted with enthusiasm would be to just have
readNewLine()
, have it take an optional bool param saying whether or not to return a bool (where the default would be to not), and use where clauses and overloads to get the right signatures. But I'd rather just learn what match vs. read in these contexts than pursue this myself.
I wouldn't be upset by this strategy in this case. However I like readLiteral
and matchLiteral
better.
// All of these (read/write/match) will be marked unstable for 1.28 proc channel.writeLiteral(lit:string) : void throws proc channel.writeLiteral(lit:bytes) : void throws proc channel.writeNewline() : void throws
For what it's worth, I don't think the write variants need to be marked unstable, I think it was pretty clear what we'd do with them
What should readNewline and matchNewline do about multiple consecutive newlines?
I would think it would only deal with the first and that one would have to put it into a while loop to get more than that. What would be the rationale for doing otherwise? Oh, is it the comment above about
ioNewline
consuming multiple? I'm not sure why it would do that offhand and would be surprised if these routines were to. Did Michael implement it, and does he know?I don't know why it is that way today. The only thing I can guess is that it is related to the "ignore whitespace" rule for ioLiteral and maybe it thinks of these earlier newlines as whitespace. I don't have any particular preference for what it should do, though.
I wanted to follow-up on this briefly. Reading a newline was not in fact skipping over multiple newlines. It was the following call to channel.read
:
var x : int;
r.read(x);
Sorry for the false alarm!
There were a couple of open questions I raised when I first implemented this functionality. I have a couple of proposed answers to these questions that would hopefully allow for us to move forward with what we have already implemented.
What if a literal contains leading whitespace, and ignoreWhitespace is true (for read/match methods)?
Currently the behavior of readLiteral
and matchLiteral
is to ignore leading whitespace in the given string literal when ignoreWhitespace=true
. This means that a given string like " test"
(with three leading spaces) will match the text "test"
. Setting ignoreWhitespace=false
will allow the method to correctly consider leading whitespace in the desired string literal.
We have a couple of options here:
A) Regard the current behavior as a bug, and update the implementation to never ignore leading whitespace in the desired string literal.
B) Regard the current behavior as a "feature", where ignoreWhitespace=true
allows for more flexible matching of text. Taking this approach would require users to be more careful about setting ignoreWhitespace=false
when whitespace actually matters.
I am currently leaning towards option A. I don't think that either of these should stand in the way of stabilizing these methods.
Should the read/match newline methods contain an ignoreWhitespace=true argument?
The current behavior for readNewline
and matchNewline
is consistent with ignoreWhitespace=true
. I believe we can add an optional argument like this at a later date if desired without breaking any user code, and so I don't think an answer here should delay in stabilizing these methods.
@benharsh : With respect to the readLiteral()/matchLiteral() behavior w.r.t. whitespace: During AoC 2022 (specifically the monkey exercise), I struggled a lot with when a readf()
would or would not skip over whitespace. For example, I believe if I failed to consume a newline and then did readf("age: %i", myInt);
, it would complain because the newline didn't match age
. In the end, I don't think I found any cases that seemed wrong or inconsistent, but it was a stumbling block for me. I bring it up because I think it would make sense for the default behavior of matchLiteral("Ben") to probably be the same as readf("Ben");
(this may be a no-brainer, but I wanted to mention it since I don't think readf() has come up on this issue).
What I'm less certain of is what matchLiteral(" Ben");
should do. I think in a readf() context, a whitespace character means "match any amount of whitespace" (?). Should matchLiteral() do the same, or take the more literal approach of "I must literally match a single space and no other whitespace". I don't have a strong opinion about what should happen there, but wanted to mention the issue.
Sorry @bradcray, I was just reviewing this issue and realized I didn't reply to your comment.
@benharsh : With respect to the readLiteral()/matchLiteral() behavior w.r.t. whitespace: During AoC 2022 (specifically the monkey exercise), I struggled a lot with when a readf() would or would not skip over whitespace. For example, I believe if I failed to consume a newline and then did readf("age: %i", myInt);, it would complain because the newline didn't match age. In the end, I don't think I found any cases that seemed wrong or inconsistent, but it was a stumbling block for me. I bring it up because I think it would make sense for the default behavior of matchLiteral("Ben") to probably be the same as readf("Ben"); (this may be a no-brainer, but I wanted to mention it since I don't think readf() has come up on this issue).
I'm not exactly sure whether you're saying the readf behavior here is desirable or not. Either way, the current behavior for matchLiteral
is to ignore whitespace by default, which would avoid the problem with newlines. Here's an example program to demonstrate:
use IO;
proc main() {
var f = openMemFile();
{
f.writer().write("Hello\nWorld\n");
}
// readf
{
try {
var r = f.reader();
r.readf("Hello");
r.readf("World"); // will throw a bad format error
writeln("no readf errors");
} catch e {
writeln(e.message());
}
}
writeln("-----");
// matchLiteral
{
try {
var r = f.reader();
writeln(r.matchLiteral("Hello"));
writeln(r.matchLiteral("World")); // true
} catch e {
writeln(e.message());
}
}
writeln("-----");
// matchLiteral: ignoring whitespace
{
try {
var r = f.reader();
writeln(r.matchLiteral("Hello", ignoreWhitespace=false));
writeln(r.matchLiteral("World", ignoreWhitespace=false)); // false
} catch e {
writeln(e.message());
}
}
}
What I'm less certain of is what matchLiteral(" Ben"); should do. I think in a readf() context, a whitespace character means "match any amount of whitespace" (?). Should matchLiteral() do the same, or take the more literal approach of "I must literally match a single space and no other whitespace". I don't have a strong opinion about what should happen there, but wanted to mention the issue.
To me, the answer here depends on the ignoreWhitespace
argument. Let's say our desired literal string is " Foo"
, and our channel is pointing to the characters:
Foo
(that's five spaces)
If ignoreWhitespace=true
, then I think it's reasonable to ignore the first four spaces, match the fifth, and then Foo
successfully. If we're not ignoring whitespace, which is the opt-in behavior, then I think it's reasonable to not match the text.
But perhaps the real question is what the default should be for ignoreWhitespace
?
Summary for the ad-hoc sub-team discussion next week:
We're intending to replace ioLiteral and ioNewline with:
proc fileReader.readLiteral(literal:string, ignoreWhitespace=true) : void throws
proc fileReader.readLiteral(literal:bytes, ignoreWhitespace=true) : void throws
proc fileReader.readNewline() : void throws
// returns 'false' if literal could not be matched, or on EOF
proc fileReader.matchLiteral(literal:string, ignoreWhitespace=true) : bool throws
proc fileReader.matchLiteral(literal:bytes, ignoreWhitespace=true) : bool throws
proc fileReader.matchNewline() : bool throws
proc fileWriter.writeLiteral(literal:string) : void throws
proc fileWriter.writeLiteral(literal:bytes) : void throws
proc fileWriter.writeNewline() : void throws
These methods ignore the serializer/deserializer. Both string and bytes are accepted. readLiteral
will throw if something unexpected is found, while matchLiteral
will return a bool
, which is better for use in looping contexts.
They are intended to use when the contents are expected to be consistent, rather than varying based on the value of a field or variable. For instance, if a file should always start with a specific header, that header could be read or written using these methods. This also useful when using serializers/deserializers where strings are otherwise expected to have surrounding quotation marks.
We still have a couple of questions we want to resolve:
A. For readLiteral
and matchLiteral
, if a literal contains leading whitespace and ignoreWhitespace
is true
, what should we do?
The current behavior today is to ignore leading whitespace when matching, meaning that a given string like " test"
(with three leading spaces) will match the text "test"
.
B. Should readNewline
and matchNewline
also get an ignoreWhitespace=true
option?
The current behavior today is consistent with having that option and setting it to true
.
Arkouda has 12 uses in src/compat/*/ArkoudaTimeCompat.chpl that it copied from the Time module to maintain compatibility between versions. There are no uses outside of the compatibility module.
CHAMPS has 6 uses of ioNewline in its vec3 file's writeThis routines.
ioLiteral and ioNewline are records that were added with the rest of the QIO support 12 years ago. They were intended especially to enable code to be written that handled both reading and writing, but we've seen little evidence that supporting both in the same method is really useful, hence splitting them into reading and writing methods. The record aspect of them in particular has seemed unnecessary and also can be problematic when resolving other problems (see related issues below)
a. In Python, you can pass os.linesep
though I don't know if this is done much in practice.
b. C/C++ - C++'s std::endl handles the insertion of a newline character when writing. I think using literals with the <<
operator works in a similar way
c. Rust - I'm not seeing an equivalent in Rust
d. Swift - I'm not seeing an equivalent in Swift
e. Julia - I'm not seeing an equivalent in Julia
f. Go - I'm not seeing an equivalent in Go
read
and readln
here: https://github.com/chapel-lang/chapel/issues/14107For question A (if a literal contains leading whitespace and ignoreWhitespace
is true
, what should we do?):
A1. Regard the current behavior as a bug, and update the implementation to never ignore leading whitespace in the desired string literal.
A2. Regard the current behavior as a "feature", where ignoreWhitespace=true
allows for more flexible matching of text. Taking this approach would require users to be more careful about setting ignoreWhitespace=false
when whitespace actually matters.
Ben favors (and I agree with) option A1.
For question B (Should readNewline
and matchNewline
also get an ignoreWhitespace=true
option?):
B1. Skip doing this for now, current behavior matches ignoreWhitespace=true
so adding the argument wouldn't break existing code
B2. Add the option now
B3. Add the option now and set the default to false, where the methods would return true only if the newline was the first whitespace character encountered from where the fileReader currently is in the file.
Ben favors (and I agree with) option B1
Discussion summary:
ignoreWhitespace
argument in future). The API decision has been made about it, we can document it, and the implementation should hopefully be easy. Whether this addition does happen before 2.0 or not is unimportant and based on the prioritization of other efforts.Discussion notes:
Today we have records ‘ioLiteral’ and ‘ioNewline’ to differentiate from regular string reads/writes:
This situation continues with our current Encoder/Decoder plans [18504]
However, ‘ioLiteral’ and ‘ioNewline’ as separate types seems unnecessarily confusing
Proposal: