chapel-lang / chapel

a Productive Parallel Programming Language
https://chapel-lang.org
Other
1.78k stars 418 forks source link

I/O module: readUntil #19769

Closed mppf closed 1 year ago

mppf commented 2 years ago

This issue is a spin-off from issue #19496.

This issue proposes having a readUntil for reading a string/bytes until some kind of separator. Since reading a line is a very common operation, that should get its own function (e.g. readLine as discussed in #19495). However it is useful to have a function that can read until something for other cases.

Here is a sketch of what this might look like:

// maxSize arguments indicate that the function should throw
// if it finds a input longer than that (and leaves the input there)

// keepSeparator means that a separator found in the input will be included in
// the returned string. Note that if the input reaches EOF without a separator,
// the returned string won't contain a separator, even if keepSeparator=true.

// The first set uses a separator that is a string/bytes, so it could be e.g. "end".

// For this one, t can be bytes or string
proc reader.readUntil(type t=string, separator: t, maxSize=-1, keepSeparator=true): t throws

// these two functions:
//   return `false` if EOF is reached and no data is read
//   resize the passed string/bytes (but may reuse the existing buffer)
proc reader.readUntil(ref s: string, separator: string, maxSize=-1, keepSeparator=true): bool throws
proc reader.readUntil(ref b: bytes, separator: bytes, maxSize=-1, keepSeparator=true): bool throws

// The second set reads until a regular expression
proc reader.readUntil(type t=string, separator: regex(t), maxSize=-1, keepSeparator=true): t throws

// these two functions:
//   return `false` if EOF is reached and no data is read
//   resize the passed string/bytes (but may reuse the existing buffer)
proc reader.readUntil(ref s: string, separator: regex(string), maxSize=-1, keepSeparator=true): bool throws
proc reader.readUntil(ref b: bytes, separator: regex(bytes), maxSize=-1, keepSeparator=true): bool throws

These functions have some similarity to:

Should it be called readUntil? Or does that imply that it leaves the separator in the input? I don't think this function should leave the separator in the input. I'm not sure I can think of a better name. readPast doesn't sound great to me (and "past" could be misinterpreted as "read from history").

Also, should we have set-of-strings (for "read until any of these characters" operations) or a read-until-whitespace variant?

bradcray commented 2 years ago

What do you think about having a bool/enum argument that says whether to consume the delimiter or leave it in place? That would make the rationale for these names stronger. Making it not have a default would have the benefit of forcing the user to check their assumptions, though it might be slightly annoying to anyone who thought there was an "obvious" right default.

I can imagine cases where the delimiter would want to be left alone. For example in reading CLBG fasta files, I could imagine using a readUntil(..., ">", ) in which I would not want it to consume the ">" but to leave that as the start of the next sequence. There isn't really anything that marks the end of a sequence apart from the start of the next one or the EOF itself.

Which raises one other question: Am I correct that if EOF is reached before the delimiter is found, this will consume the rest of the file (and indicate EOF)?

mppf commented 2 years ago

What do you think about having a bool/enum argument that says whether to consume the delimiter or leave it in place? That would make the rationale for these names stronger. Making it not have a default would have the benefit of forcing the user to check their assumptions, though it might be slightly annoying to anyone who thought there was an "obvious" right default.

Seems reasonable. Do we really need all 2x2 combinations? They are:

  1. store separator in resulting string but leave it in the input
  2. store separator in resulting string and consume it from the input
  3. don't store separator in resulting string and leave it in the input
  4. don't store separator in resulting string and consume it from the input

Arguably (1) is weird/confusing (because it would lead to reading the separator twice). If we could get away with not supporting (4) then we could have just one bool, where if it's stored in the result, it's consumed from the input; and if not, it's not consumed.

I can imagine cases where the delimiter would want to be left alone. For example in reading CLBG fasta files, I could imagine using a readUntil(..., ">", ) in which I would not want it to consume the ">" but to leave that as the start of the next sequence. There isn't really anything that marks the end of a sequence apart from the start of the next one or the EOF itself.

Yeah, that makes sense. I imagine though in such cases you wouldn't want the ">" in the resulting string (so it is case 3 above).

Which raises one other question: Am I correct that if EOF is reached before the delimiter is found, this will consume the rest of the file (and indicate EOF)?

I tried to say this earlier:

// keepSeparator means that a separator found in the input will be included in
// the returned string. Note that if the input reaches EOF without a separator,
// the returned string won't contain a separator, even if keepSeparator=true.

It's intended to behave like readLine proposals in this way. If there is no separator and you pass keepSeparator=true, you'll get a string out that doesn't have the separator, and it would return that string / true and not throw (because some data was read). The next I/O would indicate EOF.

If you have keepSeparator=false, you won't be able to notice this situation, other than that the next read indicates EOF.

bradcray commented 2 years ago

All of options 2-4 seem useful to me. Not supporting 4 seems similar to not supporting a "dropNewline"-style option for readLine(). I.e., if I'm doing readUntil("\n") I may want to advance past the newline yet not have it store up in my result (and equivalently for other separators apart from newline, I think).

mppf commented 2 years ago

From an off-issue discussion, it was approximately split 50/50 between people who thought that we really do need all 2x2 combinations to be available and people who thought that we only needed 2 combinations - always consume but chomp/strip it or not.

mppf commented 1 year ago

The most recent proposal was to have two boolean arguments to specify all of the 4 possible behaviors above. The boolean arguments could be

There is a proposal to use an enum instead. A straw-person for that is enum separator { leave, consume, return } but that specific proposal can't work because return is a keyword.

We discussed these in a meeting but there was not a convergence.

bradcray commented 1 year ago

Some early-morning musings on this:

bradcray commented 1 year ago

Re-reading the OP in https://github.com/chapel-lang/chapel/issues/21392 makes me realize that another way to handle the naming of the two-routine solution would be to have readLine() handle the "consume separator" case by taking an optional separator; and then to have readUntil() require a separator and support the exclusive behavior. The main downside of this is that it arguably abuses the intuitive notion of what a "line" is; it also makes the readLine() interface a little more complicated.

That also makes me think of readWord() or readChunk() as alternative "consume separator" routine names that focus on what is being consumed like readLine() rather than what is being stopped at.

mppf commented 1 year ago

I'm often advocating for different routine names for different behaviors, but it doesn't bother me in this case that readUntil can consume the separator or not.

That also makes me think of readWord() or readChunk() as alternative "consume separator" routine names that focus on what is being consumed like readLine() rather than what is being stopped at.

Yeah, that seems like an interesting possibility. Other ideas along those lines: readDelimited or readSeparated.

What about the enum idea? I think my main reservation with it is, if we add an enum here because of some nervousness with code like readUntil(x, true, true) that could be better written readUntil(x, consumeSeparator=true, includeSeparator=true) -- why aren't we making all of our bool formals into enums? Why doesn't readLine use an enum for the stripNewline formal? Can't we also argue that readLine(x, true) is better written readLine(x, stripNewline=true)?

I guess at the end of the day, using an enum here feels to me like it's enforcing a kind of style guidance. At the same time, I still think that the two-argument version has clearer meaning.

Lastly, include is a keyword, so it can't be an enum element here.

Anyway, heading in a different direction. Suppose we had readDelimited. I've been thinking we might wish to make the formal argument name more consistent with readLine.

What about proc readDelimited(ref s: string, separator: string, maxSize=-1, readSeparator=true, stripSeparator=false) ?

So to get the 3 behaviors we think have use-cases:

That leaves no way to get the 4th behavior that we don't think is useful, which should be OK.

About the name readSeparator: one can argue that this name not ideal, because the separator is technically read either way, it's just a matter of where the channel position is left. (So, going based on how the implementation will work, the name would be something like rewindToSeparatorStart). Anyway, I think that saying "The separator was read" has close enough meaning to "The channel current position is beyond the separator" for this purpose and the name readSeparator is relatively intuitive.

lydia-duncan commented 1 year ago

It worries me a little to have an argument that gets ignored, but I'd be okay with it if we were very explicit that that would happen in the documentation for it

jeremiah-corrado commented 1 year ago

I'm also somewhat opposed to ignoring the second argument if the first is false.

The don't-consume-but-do-include behavior is definitely weird and I don't think many people will use it, but I think it would be confusing to define a method where it appears to be supported without actually supporting it. Put another way, I think it's better if users can go through the process of writing: while readUntil(s, -1, false, true) do writeln(s);, getting something they maybe didn't expect, and then changing their code and seeing a change in output.

Assuming my current implementation makes sense, supporting readSeparator=false, includeSeparator=true is as simple as appending the separator string to the returned value, so I don't see a strong reason not to support it (at least from the implementation perspective).

more on implementation... FWIW, this is what the logic to interpret the flags looks like in my implementation (at least for now). Where `readUntil(..., true, true)` and `readUntil(..., false, false)` are the most efficient modes, and the other two modes involve an extra step that manipulates the string before returning it: ```chapel // numBytesRead contains the number of bytes between the channels starting position // and the start of the delimiter... if consumeSeparator { // read until after the separator err = readStringBytesData( s, this._channel_internal, numBytesRead + numSepBytes, numCodepointsRead + numSepCodepoints ); } if !includeSeparator { // remove the separator from the string s = s[0..
jeremiah-corrado commented 1 year ago

After writing tests for the draft readUntil implementation, I'm more in favor of separating the functionality into two separate methods — similar to what @bradcray was suggesting above.

My main reason for this, is that I wasn't able to come up with any single test case that could realistically exercise all four options (or even the three most plausible ones). Tests either focused on a consuming read (read until after the delimiter) or a non-consuming read (read up until the delimiter). Within these separate tests, I exercised both the include/don't-include options to produce different output.

Examples of the two types of behavior I'm referring to:

  1. Read a list of items (consume must be true to repeatedly reuse the same delimiter)

    while reader.readUntil(s, "|" , -1, consume=true, include=true/false) {
        myList.append(s);
    }

    (in this pattern readUntil is more or less a generalization of readLine)

  2. read up until a particular delimiter, and then look for something else that begins with that same delimiter

    var x = reader.readUntil(string, "|", consume=false, include=true/false);
    
    var y = reader.read(typeThatStartsWithBar);

    (in this pattern, readUntil is sort of a variation of fileReader.match that also returns the contents of the channel up until the match was found (or up until EOF — unlike match)).


Here is a straw proposal for what an interface might look like with two separate methods:

fileReader.readDelimited(delimiter: string, stripDelimiter: bool = false): string;
fileReader.readDelimited(ref s: string, delimiter: string, stripDelimiter = false): bool;

fileReader.readUpTo(pattern: string): string;
fileReader.readUpTo(ref s: string, pattern: string): bool;

(this would also include the analogous bytes and regex overloads)

In this proposal, readDelimited is ostensibly a generalization of readLine without any caveats about the consume vs. don't-consume option. Like readLine, there is only one flag to select whether the delimiter should be included in the returned string, and it can use the same "strip" terminology to be more uniform with readLine.

The readUpTo method could simply omit the includeSeparator flag to avoid the strange fourth case we were discussing above (i.e., don't consume, but do return). Implementation wise, it would still need to read the entire pattern, but it would never be returned.

Additionally, I believe separating out the don't-consume behavior into its own method would prevent users from accidentally doing this:

while reader.readUntil(s, delim, consume=false) {
    ...
}

because, this is more obviously a non-terminating loop (I think):

while reader.readUpTo(s, delim) {
    ...
}
mppf commented 1 year ago

Thanks @jeremiah-corrado for implementing it & sharing your experience. We also have to contend with #19610 and trying to make something consistent.

I think that is possible with @jeremiah-corrado's proposal:

fileReader.readDelimited(delimiter: string, stripDelimiter: bool = false): string;
fileReader.readDelimited(ref s: string, delimiter: string, stripDelimiter = false): bool;

fileReader.readUpTo(pattern: string): string;
fileReader.readUpTo(ref s: string, pattern: string): bool;
proc fileReader.advanceDelimited(byte: int(8)) throws;
proc fileReader.advanceDelimitedLine(): void throws;

proc fileReader.advanceUpTo(byte: int(8)) throws;
proc fileReader.advanceUpToLine(): void throws;

I'm pretty happy with this direction, but I think advanceDelimitedLine is the name here that I like the least, FWIW. Arguably, we could just provide advanceUpTo and people wanting to also skip the byte can advance one more. In any case, discussion of the details of these advance methods belongs on issue #19610; here I am bringing it up just because it is a way to evaluate our readUntil ideas. In particular, I'm trying to evaluate if the strategy for readUntil can generalize to the advance functions.)

jeremiah-corrado commented 1 year ago

I know you noted on the other issue that you don't like the "past" wording as much @mppf, but I'd personally be okay with it here and for the analogous advance procedure. Specifically:

proc fileReader.readPast(delimiter: string, stripDelimiter: bool = false): string throws;
proc fileReader.readPast(ref s: string delimiter: string, stripDelimiter: bool = false): bool throws;

proc fileReader.readUpTo(pattern: string): string throws;
proc fileReader.readUpTo(ref s: string, pattern: string): bool throws;
proc fileReader.advancePast(byte: int(8)) throws;
proc fileReader.advancePastNewline(): void throws;

proc fileReader.advanceUpTo(byte: int(8)) throws;
proc fileReader.advanceUpToNewline(): void throws;

I think readPast sounds okay with the first argument being named delimiter (as in "read past this delimiter"). Same thing for advancePast(byte). I also think "past" and "up to" contrast nicely to differentiate the two separate behaviors we're talking about.

If we do decide to use two separate methods, I'm also open to the naming proposal you showed above for the advance methods, or something else.

mppf commented 1 year ago

@jeremiah-corrado - regarding this:

read up until a particular delimiter, and then look for something else that begins with that same delimiter

In your thinking through examples, did you find cases where the result of reading up to a particular delimiter was used? I.e., should the pattern you are talking about here be using an advanceUpTo function instead?

jeremiah-corrado commented 1 year ago

Those tests definitely used some pretty contrived toy-examples. I guess I can imagine something slightly more realistic like:

var htmlBody: string;
if htmlReader.readUpTo(htmlBody, "<footer>") {
    // do something with `htmlBody` ...
    var numExternalLinks = countNumberOfExternalLinks(htmlBody);

    var footer = htmlReader.read(HtmlFooterObject);
    // ...
}

But arguably, that should/could just be:

var htmlBody: string;
if htmlReader.readDelimited(htmlBody, "</body>", stripDelimiter=false) {
    // do something with `htmlBody` ...
    var numExternalLinks = countNumberOfExternalLinks(htmlBody);

    var footer = htmlReader.read(HtmlFooterObject);
    // ...
}
input I'm imagining ... ```
...
... ```
mppf commented 1 year ago

We've had a lot of discussion around this function and I'd like to try to summarize the ideas and their Pros and Cons. I will do that in this comment and I aim to update it with any ideas or Pros/Cons brought up in further discussion.

Using an enum:

Pros:

Cons:

Using two flags:

Pros:

Cons:

Neutral:

Using two different routines:

(the first routine shown in these ideas consumes the delimiter, the second does not)

Pros:

Cons:

jeremiah-corrado commented 1 year ago

In a recent offline design discussion, we decided to go with the two-method proposal from above and made good progress on fleshing out the details of the proposal.

Working proposal:

We came up with a tentative set of names for readUntil's replacements as well as the analogous "advance" methods:

behavior: read advance
consume newline readLine(..., stripNewLine=false) advanceLine(...)
consume separator readPast(..., separator, stripSeparator=false) advancePast(..., separator)
up to separator readUpTo(..., separator) advanceUpTo(..., separator)

(readLine already exists)

readUpTo/advanceUpTo names:

The group consensus was that readUpTo and advanceUpTo are acceptable names for those methods; however we weren't sure that those are the best options. We'd like to do a bit more investigation and brainstorming before landing on them definitively.

I'll investigate what some other languages do (if they have this functionality) and report here to see if that sparks any ideas.

We also discussed:

separator argument name and type

In each of the methods with a separator argument, we'd like it to be able to take at least a string and bytes argument and potentially a regex(string)/regex(bytes). Note: whether we ultimately go with names like advancePast — as opposed to advancePastByte — depends on what we choose for the types of this argument.

I'll do a performance investigation to see if avancePast(separator="a") performs on par with advancePastByte("a".toByte()). If there is no performance drawback, then we intend to go with the method names and argument types described above.

We also discussed whether to call the argument "delimiter" or something else, but landed on "separator" because there is a precedent in other places in the library, and it sounds general enough to encapsulate string, bytes and regex(?) (whereas "delimiter" in particular doesn't sound like it would refer to a regular expression separator).

separator argument length

We discussed whether we should wait to support multi-character / multi-codepoint delimiters. There were a few reasons we may want to do so:

As such, we may want to constrict the separator to have a length of 1 byte by making it a param and emitting a compile time error if it is any longer. This restriction could be lifted as a non-breaking change in the future.

I'll do some further investigation to see if it's possible to create performant versions of these methods that allow multi-byte separators. This would likely involve calling a simpler implementation when the separator is a single byte, and a heavier implementation otherwise.

lydia-duncan commented 1 year ago

We decided to not just use the name readLine for the readUntil functionality due to the potential for confusion if a different separator than \n was provided

bradcray commented 1 year ago

I like the two-routine approach. As far as names go, I prefer readUntil("\n") over readUpTo("\n") (it rolls of the tongue better and I like that it's just two words; I think until suggests exclusive behavior if one were in doubt and didn't want to read the manual). Rather than readPast("\n"), I'd probably use readThrough("\n") (because it isn't actually reading past the \n at all).

jeremiah-corrado commented 1 year ago

I think until suggests exclusive behavior

I'd be okay with readUntil as a name, but I think readUpTo is slightly more clear w.r.t where the pointer is left off.

This could just be me, but somehow this:

var spanInnerText = htmlReader.readUpTo("</span>");

feels clearer than:

var spanInnerText = htmlReader.readUntil("</span>");

Rather than readPast("\n"), I'd probably use readThrough("\n")

We hadn't considered readThrought, but I agree that it sounds better. advanceThrough("\n") also sounds good to me.

jeremiah-corrado commented 1 year ago

Also, here is what I've found so far looking at other languages buffered IO methods:

language consuming read non-consuming read consuming ("\n" specific)
C++ get_line get_line + unget get_line (w/o delim arg)
Rust read_until read_until + seek read_line
Python csv reader w/ delimiter arg ? readline
Go custom ScanWords function ? ScanLines
Java ? ? [readLine](https://docs.oracle.com/javase/10/docs/api/java/io/BufferedReader.html#readLine())

The non-consuming-read doesn't seem like a very common behavior.

mppf commented 1 year ago

Hmm... the fact that Rust has a consuming read called read_until is evidence that it's not so obvious that readUntil would be a non-consuming read.

jeremiah-corrado commented 1 year ago

I'll do some further investigation to see if it's possible to create performant versions of these methods that allow multi-byte separators. This would likely involve calling a simpler implementation when the separator is a single byte, and a heavier implementation otherwise.

I ran a performance comparison on one of the revcomp shootout benchmarks using a procedure like the following:

proc advancePast(separator: string) {
  if separator.numBytes == 1 {
    advancePastByte(separator.toByte());
  } else {
    slowAdvancePastImpl(separator);
  }
}

I.e., I collected average execution times for two revcomp codes on a large problem size:

(1) the current code that calls advancePastByte directly, without the conditional:

reader.advancePastByte(">".toByte());

(2) and another that uses the above procedure:

reader.advancePast(">");

The performance difference was a couple of orders of magnitude smaller than the programs total runtime. So I'd be good to go ahead with a design that allows for multi-byte/multi-codepoint separators and conditionally uses higher performance implementations for single-byte separators when possible.

We had discussed using param separator arguments s.t. the faster implementation could be selected at compile time; however, this had little to no effect on performance in the above test, so I'm inclined to use non-param separators.

lydia-duncan commented 1 year ago

Rather than readPast("\n"), I'd probably use readThrough("\n") (because it isn't actually reading past the \n at all).

We did discuss the use of past in the meeting, but when we actually looked at code examples, it was pretty clear what it was doing. Here are the code examples we looked at:

while readPast("-", s, stripSeparator=false) {
    myList.append(s);
}

while readDelimited("-", s, stripSeparator=false) {
    myList.append(s);
}

while readSeparated("-", s, stripSeparator=false) {
    myList.append(s);
}

while readPastSeparator("-", s, stripSeparator=false) {
    myList.append(s);
}
jeremiah-corrado commented 1 year ago

I think readThrough is also pretty clear in the context of that example:

while r.readThrough("-", s, stripSeparator=true) {
    myList.append(s)
}

I also think we need to consider the problem Brad brought up that readPast could be interpreted as readAfter. As in: "If you look past the hill, you'll see a mountain".

I would still interpret the meaning of readPast("\n") as: read and put the pointer after the next newline. However some might interpret it as: read whatever comes after the newline. I don't think readThrough has this problem.

jeremiah-corrado commented 1 year ago

Here is a more detailed summary of the interface I think we should implement based on discussion so far:

// IO module:
proc fileReader.readThrough(separator: ?t, maxSize=-1, stripSeparator=false): t throws
  where t==string || t==bytes { ... }
proc fileReader.readThrough(ref s: string, separator: string, maxSize=-1, stripSeparator=false): bool throws { ... }
proc fileReader.readThrough(ref b: bytes, separator: bytes, maxSize=-1, stripSeparator=false): bool throws { ... }

proc fileReader.readUpTo(separator: ?t, maxSize=-1): t throws
  where t==string || t==bytes { ... }
proc fileReader.readUpTo(ref s: string, separator: string, maxSize=-1): bool throws { ... }
proc fileReader.readUpTo(ref b: bytes, separator: bytes, maxSize=-1): bool throws { ... }

proc fileReader.advanceThrough(separator: string) throws { ... }
proc fileReader.advanceThrough(separator: bytes) throws { ... }

proc fileReader.advanceUpTo(separator: string) throws { ... }
proc fileReader.advanceUpTo(separator: bytes) throws { ... }

// Formatted IO module:
proc fileReader.readThrough(separator: regex(?t), maxSize=-1, stripSeparator=false): t throws
  where t==string || t==bytes { ... }
proc fileReader.readThrough(ref s: string, separator: regex(string), maxSize=-1, stripSeparator=false): bool throws { ... }
proc fileReader.readThrough(ref s: bytes, separator: regex(bytes), maxSize=-1, stripSeparator=false): bool throws { ... }

The four "advance" methods will use qio_channel_advance_past_byte under the hood when the separator is a single byte. Otherwise, they'll leverage the same helper function as readThrough and readUpTo to find the location of the separator in the channel, and then advance to that point.

I think implementing only the regex version of readThrough for now is a good start. If users need something like readUpTo(regex(?)) or advanceUpTo(regex(?), I think those are similar enough to the existing channel.search(regex(?)):regexMatch, that they don't need to be implemented right away. These could be left as a post-2.0 task or wait for a user request. (Alternatively, these methods would all rely on the same underlying _findRegexMatch() function, so It wouldn't be too much more work to implement them all).

This is just a stake in the ground to see how people feel; I am of course still open to modifying any of the names or interface details.

jeremiah-corrado commented 1 year ago

In an ad-hoc subteam discussion, we've landed on the following design for the consuming/non-consuming read and advance methods on the fileReader type:

New Interface

The proposal from the previous message has been modified slightly. The "UpTo" methods have been renamed use "To" instead, the regex readThrough overloads have been moved to the Regex module as tertiary methods instead of living in the FormattedIO module, and the ref string/bytes formal arguments now come after the separator:

// IO module:
proc fileReader.readThrough(separator: ?t, maxSize=-1, stripSeparator=false): t throws
  where t==string || t==bytes { ... }
proc fileReader.readThrough(separator: string, ref s: string, maxSize=-1, stripSeparator=false): bool throws { ... }
proc fileReader.readThrough(separator: bytes, ref b: bytes, maxSize=-1, stripSeparator=false): bool throws { ... }

proc fileReader.readTo(separator: ?t, maxSize=-1): t throws
  where t==string || t==bytes { ... }
proc fileReader.readTo(separator: string, ref s: string, maxSize=-1): bool throws { ... }
proc fileReader.readTo(separator: bytes, ref b: bytes, maxSize=-1): bool throws { ... }

proc fileReader.advanceThrough(separator: string) throws { ... }
proc fileReader.advanceThrough(separator: bytes) throws { ... }

proc fileReader.advanceTo(separator: string) throws { ... }
proc fileReader.advanceTo(separator: bytes) throws { ... }

// Regex module:
proc fileReader.readThrough(separator: regex(?t), maxSize=-1, stripSeparator=false): t throws
  where t==string || t==bytes { ... }
proc fileReader.readThrough(separator: regex(string), ref s: string, maxSize=-1, stripSeparator=false): bool throws { ... }
proc fileReader.readThrough(separator: regex(bytes), ref s: bytes, maxSize=-1, stripSeparator=false): bool throws { ... }

Design details:

Code Examples

Here are some examples of how each new method could be used:

readThrough Read a comma separated list of integers into a list(int):

use IO, List;

var l = new list(int),
     s: string,
     r = openReader("commaSeparatedList.txt");

while r.readThrough(",", s, stripSeparator=true) {
  l.append(s:int);
}

readTo and advanceThrough Read the contents of a <details> tag from an html file:

use IO;

var r = openReader("website.html");

r.advanceThrough("<details>");
var detailsInnerText = r.readTo("</details>");

advanceTo Read a type that that is delimited by "|", skipping everything before it:

use IO;

record t {
  var x: int;

  proc readThis(fr) throws {
    fr.matchLiteral("|");
    this.x = fr.read(int);
    fr.matchLiteral("|");
  }
}

var r = openReader("textIDontWantAndThenT.txt");

r.advanceTo("|");
var myT = r.read(t);

readThrough(regex) Read a list of integers separated by commas or newlines into a list(int):

use IO, Regex, List;

var l = new list(int),
     s: string,
     r = openReader("commaAndNewlineSeparatedList.txt");

const commaOrNewline = compile("[,\\n]");

while r.readThrough(commaOrNewline, s, stripSeparator=true) {
  l.append(s:int);
}

More examples can be found in the tests in this PR: https://github.com/chapel-lang/chapel/pull/21703/files