Add library functions for stripping leading whitespace from strings

daviditen commented 6 years ago

Issue #7867 discussed the possibility of automatically removing leading whitespace from multi-line strings, but we think that it would be better (at least for non-interpreted strings) to provide library functions to do this.

var s1 = """This
            string spans
            multiple lines""".trim();

param s2 = """This
            string spans
            multiple lines""".trim();

Would create strings:

This
string spans
multiple lines

The name trim is an idea for a name, not a decision that has been made. This method could also take an argument to indicate how much leading space to remove. It would be good to have a version that operates on and returns a param string so that the s2 case above is supported.

buddha314 commented 6 years ago

Python has three strip() routines, including lstrip() and rstrip(), which makes programming a bit more racey.

daviditen commented 6 years ago

What this issue is meaning to request is actually less like strip() and more like textwrap.dedent().

cassella commented 6 years ago

I'm curious about the param version -- I can't think how one would write a chapel function that takes a param string as input and returns a param string based on it like this. Given param p = "hello", I can't even write param q = p[3].

daviditen commented 6 years ago

I'm curious about the param version -- I can't think how one would write a chapel function that takes a param string as input and returns a param string based on it like this.

It would require expanding what the compiler can do with param strings. I don't think a pure Chapel param version could be done right now without compiler changes.

ben-albrecht commented 4 years ago

I would really like to see this feature get added to the standard library. Multi-line strings are pretty ugly to use today because they require you to drop your indentation:

module M {
  record R {
    proc printHelp() {
      writeln("""Multi-line strings
do not look great when you have
to drop your indentation like this""");
     }
  }
}

For the purposes of moving forward, I propose the following answers to the open design questions:

(1) Where should this method/function live?

It should live in String.chpl. Python creates a TextWrap module for this functionality, but most of the functions from that module feel like things that belong in a package module / mason package in the Chapel world. I would argue trim() / dedent() belongs in the standard library though, because multi-line strings are significantly less useful without them (see motivation above).

@e-kayrakli mentioned that we may create a StringUtil.chpl module at some point down the road (I can link the issue when I find it). It may make sense to migrate this function over to StringUtil when it exists, but I also wouldn't want the design and development of StringUtil to hold this feature up.

(2) Should this be a procedure or method?

A method seems most natural to me. It's easier to keep track of the open/closed parentheses, and you lose less horizontal white space for your multi-line string. Example below:

```chpl // method module M { record R { proc printHelp() { writeln("""Multi-line strings look great when don't have to drop your indentation""".trim()); } } } ``` vs. ```chpl // procedure module M { record R { proc printHelp() { writeln(trim("""Multi-line strings look great when don't have to drop your indentation""")); } } } ```

(3) What should this be called?

I would say trim() or dedent(). I slightly lean towards dedent() because there is precedent for it in Python. It is a pretty arbitrary decision so we could settle it via polling the dev team.

(4) How to handle param strings?

As @daviditen points out above, this depends on some new compiler features. This shouldn't hold up the introduction of this feature, though.

bradcray commented 4 years ago

I would really like to see this feature get added to that standard library.

Do you want to take it on?

ben-albrecht commented 4 years ago

Do you want to take it on?

I am happy to take this on. I think the implementation is fairly straightforward (~40 lines of code in python, e.g.). The main blocker is reaching consensus on the design decisions.

e-kayrakli commented 4 years ago

Quick comments (to some extent discussed with Ben offline)

(1) Where should this method/function live?

String.chpl is fine in the beginning, but I think this is borderline "package" functionality. If there is a StringUtil in the future as a mason package, I'd be for moving it there.

Relevant discussion: https://github.com/chapel-lang/chapel/issues/15095#issuecomment-598712463 and to some extent my comment below

(2) Should this be a procedure or method?

Can't see a good motivation for a proc. I think it should be a method.

(3) What should this be called?

Strongly leaning towards dedent or something else other than trim. I used Java's trim before and it was doing what our strip is doing. So trim would be confusing for someone with similar background

(4) How to handle param strings?

I don't think this is a prerequisite for this. It shouldn't be very difficult to support that in our compiler, either.

bradcray commented 4 years ago

Going backwards:

(4) I agree the param case can be deferred; params will still work with a non-param version of the function, they just won't have the advantage of being processed at compile-time, so will incur execution-time overhead. That seems acceptable given the convenience-oriented nature of the function.

(3) I think dedent is more descriptive than trim

(2) I tend to agree with method over standalone

(1) No strong preference, though a downside of putting it into String.chpl is that it makes it always-available / documented in a weird place. To me it feels more standard library-ish, possibly package-ish. I'd probably use it as an excuse to kick off StringUtil, though that'll raise the question of what else in String.chpl should be moved there (which is a good question to ask).

bradcray commented 4 years ago

The main question I have about this is whether:

writeln("""hi
         there
         ben""".dedent());

works as expected or not, or whether the initial line has to have the same number of spaces as the subsequent ones to work properly. The Python example I'm looking at is unclear on that point. Also, what is the behavior of a case where each line has a differing number of spaces:

writeln("""hi
          there
           ben
            albrecht""".dedent());

ben-albrecht commented 4 years ago

(1) Where should this method/function live?

String.chpl is fine in the beginning, but I think this is borderline "package" functionality. If there is a StringUtil in the future as a mason package, I'd be for moving it there.

Relevant discussion: https://github.com/chapel-lang/chapel/issues/15095#issuecomment-598712463 and to some extent my comment below

Ah, I thought StringUtil would be a standard library module.

I would prefer dedent be made available in the standard library (whether that be through String.chpl or something else), since multi-line strings is far less ergonomic without it.

(3) What should this be called?

Strongly leaning towards dedent or something else other than trim. I used Java's trim before and it was doing what our strip is doing. So trim would be confusing for someone with similar background

Good point. I did not realize Java had a trim method. I am in favor of dedent as well then.

ben-albrecht commented 4 years ago

The main question I have about this is whether:
writeln("""hi
         there
         ben""".dedent());
works as expected or not, or whether the initial line has to have the same number of spaces as the subsequent ones to work properly. The Python example I'm looking at is unclear on that point.

In Python, dedent doesn't do anything special for the first line, so the minimum indentation would be 0 in that example (from the first line), and you'd get this as the output:

hi
                        there
                        ben

That is why they suggest doing the following in the python documentation:

print(dedent("""\
                hi
                there
                ben"""))

We could special-case the first line to give your example the expected output (all indentation removed). However, this would break the case where a user does want indentation only on the first line:

writeln("""
            hi
          there
        brad!""".dedent());

User wants:

    hi
  there
brad!

Special-casing first line would make it:

hi
  there
brad!

Maybe this could be controlled via an argument in dedent()?

bradcray commented 4 years ago

My intuition would be to only special-case the first line in the event that it did not start with a space.

(I'd also be open to an optional argument saying whether or not the first line should be special-cased or not, defaulting to "yes"; along those lines, I liked the proposal above to have an optional "how much space to dedent by" argument, where the default (of 0 maybe?) would essentially be "autodetect").

ben-albrecht commented 4 years ago

My intuition would be to only special-case the first line in the event that it did not start with a space.

I think that could work.

(I'd also be open to an optional argument saying whether or not the first line should be special-cased or not, defaulting to "yes"; along those lines, I liked the proposal above to have an optional "how much space to dedent by" argument, where the default (of 0 maybe?) would essentially be "autodetect").

I am also open to that.

ben-albrecht commented 4 years ago

Any ideas for the arg name? I'll propose level for now (level of dedentation).

Here's the current straw-function signature:

proc string.dedent(level=0) { ... }

bradcray commented 4 years ago

proc string.dedent(columns=0, ignoreFirst=true);

???

ben-albrecht commented 4 years ago

Iterating a bit further before we request input from the core dev team:

/* Removes indentation from a (multi-line) string and returns the resulting string.

      :arg columns: Maximum number of columns of indentation to remove. If ``0``, the number of columns will be determined automatically from the line with the minimum indentation.
      :arg ignoreFirst:  If `true`, ignores the first line if the first line is not indented when automatically determining the number of columns to remove (when ``columns == 0``). 

*/
proc string.dedent(columns=0, ignoreFirst=true) { }

Here are a few test cases I've whipped up to demonstrate some of the expected behaviors:

```chpl // // Default behavior (no arguments) // { const s = """1 2 3""".dedent(); writeln(s); /* 1 2 3 */ } { const s = """ 1 2 3""".dedent(columns=10); writeln(s); /* 1 2 3 */ } { const s = """ 1 2 3 """.dedent(); writeln(s); /* 1 2 3 */ } { // Lines with only whitespace are ignored and treated as a single newline // character. Note there is trailing whitespace in this example: const s = """ 1 2 3 """.dedent(); writeln(s); /* 1 2 3 */ } // // Arguments // { // Recognize the indentation level of the first line const s = """1 2 3""".dedent(ignoreFirst=false); writeln(s); /* 1 2 3 */ } { // Remove 10 columns of indentation (there are 11 columns before 2 and 3) const s = """1 2 3""".dedent(columns=10); writeln(s); /* 1 2 3 */ } ```

A few more questions that came up as I was writing this. I'll give some proposed answers as well:

(Q1) How should ignoreFirst=false and columns!=0 interact?

Since ignoreFirst impacts whether the first line is counted when inferring the number of columns to remove, I think ignoreFirst should do nothing when columns!=0. It is admittedly a little confusing because a user might expect ignoreFirst to mean "don't de-indent the first line". Perhaps we want to emit a warning or error if a user changes ignoreFirst from its default value and also sets columns != 0.

(Q2) How should tab characters be treated?

Python's dedent does the following:

Note that tabs and spaces are both treated as whitespace, but they are not equal: the lines " hello" and "\thello" are considered to have no common leading whitespace.

I toyed around with this behavior in Python's dedent, and agree with their take on this. This seems much more reasonable than trying to assign a whitespace value to tabs, e.g. 1 tab = 2 whitespaces.

bradcray commented 4 years ago

I would've expected this case:

{
const s = """ 1
           2
           3""".dedent(columns=10);

writeln(s);

to have given

 1
2
3

due to the columns=10 (which I interpret as "get rid of (up to) 10 columns of initial spaces).

For this:

// Lines with only whitespace are ignored and treated as a single newline
// character. Note there is trailing whitespace in this example:

is there precedent for this? It seems odd to me. Like, why would I have put a blank line with 10 spaces if I didn't want a blank line in there?

How should ignoreFirst=false and columns!=0 interact?

I think of ignoreFirst as meaning "do no processing of the first line", so I'd expect ignoreFirst=true and columns = 3 for " hi" to give " hi" and ignoreFirst=false and columns = 3 to give "hi".

(Q2) How should tab characters be treated?

I don't disagree with your conclusion, but it gives me some pause as to whether "columns" is the right choice for the argument anymore...

ben-albrecht commented 4 years ago

due to the columns=10 (which I interpret as "get rid of (up to) 10 columns of initial spaces).

If we take your interpretation of ignoreFirst below, wouldn't the current output be correct? That is, the first line should not be processed by default, and therefore no indentation should be removed. (I realize I've contradicted myself in this example, and that's a think-o on my part).

is there precedent for this? It seems odd to me. Like, why would I have put a blank line with 10 spaces if I didn't want a blank line in there?

This is how Python's dedent works. If these blank lines weren't ignored, should they interpreted as levels of indentation when inferring the level indentation to remove? I suspect Python chose to ignore these, to avoid this potential confusion for users.

I don't disagree with your conclusion, but it gives me some pause as to whether "columns" is the right choice for the argument anymore...

True. Throwing out a few alternative ideas:

margin
- this is what the python implementation calls the longest level of indentation, but that's not user-facing in any way
whitespaces
spaces
level
n

bradcray commented 4 years ago

wouldn't the current output be correct? That is, the first line should not be processed by default, and therefore no indentation should be removed. (I realize I've contradicted myself in this example, and that's a think-o on my part).

I ate up one space that I shouldn't have (and have edited my comment to restore it), but it seems like your output ignored not just the first, but the second and third lines as well (i.e., they are still intended 10 columns).

Maybe a related question: I would have expected:

"""a
b
 c
  d
   e
""".dedent(columns=5)

to give:

a
b
c
d
e

(i.e., that columns gives a maximum, not required number of whitespace characters to gobble up). But this may not agree with the "common leading whitespace" viewpoint of Python (because it might also suggest having columns=1 gobble up one space or one tab or ... Unless the "common leading whitespace" only fires when columns=0 and we're in "auto-detect" mode.

If these blank lines weren't ignored, should they interpreted as levels of indentation when inferring the level indentation to remove?

Doesn't the inference only occur based on the initial line? (in which case, only a blank initial line would affect the level of indentation which seems acceptable... or we could change ignoreFirst to be ignoreInitial where it would ignore the first line and any subsequent blank lines). Or does Python try to examine all the leading space of all the lines and make a "smart" choice about how much to dedent by?

ben-albrecht commented 4 years ago

Python try to examine all the leading space of all the lines and make a "smart" choice about how much to dedent by?

This is what Python's dedent does - find the minimum matching indentation (accounting for both tabs and spaces), and dedent by that much.

bradcray commented 4 years ago

Here's my mental algorithm (which may or may not be appropriate / match Python's):

treat the string as a series of lines, separated by \n
if columns=0
- find the "model" line
- if ignoreFirst/ignoreInitial=false, this will be the first line
- if ignoreFirst=true, this will be the the second line
- if we use ignoreInitial=true, this will be the first non-blank line after the first
- count the number of leading whitespace characters in the model line and use this as the columns value
- (alternatively, we could: capture the leading whitespace pattern up to the first non-whitespace or \n character, and use this as the whitespace-to-be-stripped pattern, but this seems more complex to me)
now, we have a non-zero columns value
- if ignoreFirst/ignoreInitial=true, skip over the first / initial all-whitespace lines
- if ignoreFirst/ignoreInitial=false, use all lines
- for each line defined by the above, remove the first columns whitespace characters from it, if they exist (if they don't, stop early)

bradcray commented 4 years ago

This is what Python's dedent does - find the minimum matching indentation (accounting for both tabs and spaces), and dedent by that much.

Ah, OK. In that case, my "find the model line" step above would need to be "find the set of model lines" (defined as the model line and all subsequent lines) and the "count the number of leading whitespace characters" step would need to be a "find the minimum number of leading whitespace characters across all these lines" step.

Given that definition, I think I agree that an all-whitespace line wouldn't contribute to the minimum leading whitespace computation, but I still don't think it should be dropped from the output; I think its \n should be preserved, at the very least.

e-kayrakli commented 4 years ago

Given that definition, I think I agree that an all-whitespace line wouldn't contribute to the minimum leading whitespace computation, but I still don't think it should be dropped from the output; I think its \n should be preserved, at the very least.

I second this notion. Removing those newlines seems to defeat the purpose that dedent should make multiline string formatting easier.

bradcray commented 4 years ago

The more I think about it, the more I think the "scan all candidate lines" approach is best to handle cases like:

   a
  b
c
  d
    e

So here's an updated mental algorithm:

treat the string as a series of lines, separated by \n
if columns=0
- find the set of model lines
- if ignoreFirst=false, this will be all lines containing non-whitespace characters
- if ignoreFirst=true, this will be all lines other than the first containing non-whitespace characters
- set columns to the minimum number of leading whitespace characters in any of the model lines
given our non-zero columns value:
- find the set of lines to modify
- if ignoreFirst=true, this is all lines but the first
- if ignoreFirst=false, this is all lines
- for each line to modify, remove the first columns whitespace characters from it, stopping at the first \n or non-whitespace character

An implication of this is that:

"""a
  b
\t\tc
\t d
 \te
""".dedent()

would produce:

a
b
c
d
e

(i.e., tabs and spaces would be treated interchangeably). I don't have any insights into how to treat them differently and reasonably / portably.

ben-albrecht commented 4 years ago

Given that definition, I think I agree that an all-whitespace line wouldn't contribute to the minimum leading whitespace computation, but I still don't think it should be dropped from the output; I think its \n should be preserved, at the very least.

I second this notion. Removing those newlines seems to defeat the purpose that dedent should make multiline string formatting easier

I think we've come back to the original proposed behavior, which matches that of python: lines with only whitespace are treated as a single newline and ignored w.r.t. determining level of indentation.

ben-albrecht commented 4 years ago

(i.e., tabs and spaces would be treated interchangeably). I don't have any insights into how to treat them differently and reasonably / portably.

In the Python implementation, they don't treat tabs and spaces interchangeably. They find the longest common string of tabs and spaces and use that as the string to remove from each line. I think I prefer this behavior, because treating '\t' and ' ' interchangeably is essentially treating tabs as a single whitespace.

If we went with that behavior, your example above would have no common string of tabs and spaces so the dedent would essentially be a no-op:

"""a
  b
\t\tc
\t d
 \te
""".dedent()

would produce:

a
  b
\t\tc
\t d
 \te

bradcray commented 4 years ago

I think we've come back to the original proposed behavior

I disagree, or else your example has a bug. My concern wasn't that a whitespace-only line would be ignored when determining the dedent level so much as that your output dropped the whitespace-only line from the output (i.e., the blank line between 1 and 2 went away in the output; this seems extremely counterintuitive and undesirable to me).

ben-albrecht commented 4 years ago

or else your example has a bug

Looking back at it - my example had a bug. The blank line between 1 and 2 should still be there. Sorry for the confusion!

Let me put these examples into a Gist that I will keep updated as the discussion evolves.

bradcray commented 4 years ago

Let me put these examples into a Gist that I will keep updated as the discussion evolves.

Or put them in tests/futures as a PR since that would be easy to comment on.

ben-albrecht commented 4 years ago

Created a draft PR here for discussion: https://github.com/chapel-lang/chapel/pull/15664

See if you agree with those cases (I'm sure I've made a mistake somewhere 😄 ), or if there are other cases I should include.

ben-albrecht commented 3 years ago

It should live in String.chpl. Python creates a TextWrap module for this functionality, but most of the functions from that module feel like things that belong in a package module / mason package in the Chapel world. I would argue trim() / dedent() belongs in the standard library though, because multi-line strings are significantly less useful without them (see motivation above).

After implementing this in a standalone file and migrating it to the String module, I noticed (1) the String module does not directly use the Regexp module anywhere else and (2) using the Regexp module in the String module introduces a bunch of errors in ByteBufferHelpers (likely due to the module-level code of Regexp).

Before working through the Regexp use errors, I wanted to raise the question: Is it OK for a String method to utilize Regexp? If not, should we consider putting this in a different module or try to develop a non-regexp implementation?

The regular expressions used are:

      // Replace all lines containing only white space (spaces or tabs) with empty string
      const reWhitespaceOnly = compile('''(?m:^[ \t]+$)''');
      var text = reWhitespaceOnly.sub('', this);

      // Match all indentations (leading white space)
      const reLeadingWhitespace = compile('''(?m:^[ \t]+)''');
      var indents = reLeadingWhitespace.matches(text);

bradcray commented 3 years ago

Is it OK for a String method to utilize Regexp?

I've generally been trying to fight against the tide of having "hello world" compile everything (and work towards reversing that trend), so having String.chpl (which is included by default, right?) automatically bring in Regexp seems like a step in the wrong direction to me. (Also, isn't Regexp something that isn't guaranteed to be available in all builds of Chapel? So it seems like things might break badly in a CHPL_REGEXP=none build?).

I guess my next questions would be:

is implementing this without using Regexp particularly difficult? (I haven't thought about it in awhile)?
would it be weird to have it in a separate library module rather than being auto-available? (e.g., StringUtil.chpl, not that I like that name)

I'll also be curious what @e-kayrakli thinks about this when he's back in the office, of course.

ben-albrecht commented 3 years ago

I've generally been trying to fight against the tide of having "hello world" compile everything (and work towards reversing that trend), so having String.chpl (which is included by default, right?)

Agreed.

is implementing this without using Regexp particularly difficult? (I haven't thought about it in awhile)?

Not terribly difficult AFAICT, but probably not nearly as computational efficient as Regexp without some optimization effort - which may be acceptable for now.

would it be weird to have it in a separate library module rather than being auto-available? (e.g., StringUtil.chpl, not that I like that name)

I'd be open to this idea, if we intended to add more features to this module in the future (@e-kayrakli mentioned there was plans for adding other string utils). Maybe this could even be a String submodule?

I'll also be curious what @e-kayrakli thinks about this when he's back in the office, of course.

Agreed.

bradcray commented 3 years ago

but probably not nearly as computational efficient as Regexp without some optimization effort - which may be acceptable for now.

I wouldn't think that this is a routine whose performance we'd need to worry about too much. That said, it also seems plausible to me that the simple (and typically short) patterns that I think we care most about for this routine may actually be faster to do without leveraging a general-purpose regular expression evaluator (?).

Maybe this could even be a String submodule?

I don't think that would fix the dependence issue, as I don't think we have any form of conditional compilation for sub-modules. Also, since the user isn't meant to have any notion that there is a String module (since it's part of the language), I'm not sure how we'd present such a submodule to them.

Based on this exchange, I think I'd implement it without regexp and just keep it in String.chpl for simplicity, but I think Engin's back tomorrow?

ben-albrecht commented 3 years ago

Based on this exchange, I think I'd implement it without regexp and just keep it in String.chpl for simplicity, but I think Engin's back tomorrow?

👍 Already working on it. We'll see what @e-kayrakli says when he's back. I can do some sanity check to see how comparable performance is with Regexp too.

Thanks for the input.

e-kayrakli commented 3 years ago

I've generally been trying to fight against the tide of having "hello world" compile everything (and work towards reversing that trend), so having String.chpl (which is included by default, right?) automatically bring in Regexp seems like a step in the wrong direction to me. (Also, isn't Regexp something that isn't guaranteed to be available in all builds of Chapel? So it seems like things might break badly in a CHPL_REGEXP=none build?).

Completely agree with this. I am against using standard modules in internals and against using package modules in standard modules as much as possible philosophically. But using Regexp is even worse because of its dependency.

I still think that this function should reside outside of the String module. Glancing over the past discussion here, I think we can put it in a StringUtils module or something. Having something like that in a Standard module can take some more time and thinking, so, maybe add this in a package module StringUtils, then we think more about what to add and move that to the standard modules? I'd be fine to just adding it as an unstable standard module, too.

(Very mildly related: A similar discussion about string utilities came up as part of adding a string.findAll: https://github.com/chapel-lang/chapel/issues/15095. However, I think that particular function can be part of the string type)

ben-albrecht commented 3 years ago

I can do some sanity check to see how comparable performance is with Regexp too.

Here are some crudely collected timings (100k trials on a small test set) of some regexp-free implementations I threw together. There's definitely room for improvement, but I'd be OK proceeding with the less performant regexp-free implementations for now.

_	regexp	no regexp
whitespace removal	0.131741	0.24877
leading whitespace matching	0.225953	0.888304

I still think that this function should reside outside of the String module.

@e-kayrakli - earlier you said:

String.chpl is fine in the beginning, but I think this is borderline "package" functionality. If there is a StringUtil in the future as a mason package, I'd be for moving it there.

which made me think we had a consensus on putting it inString.chpl for now. Do you oppose putting it in String, even if the implementation doesn't rely on regexp?

I am OK not putting it in String.chpl. I do think it should be available by default (via standard library or package module) because the language feature of multiline strings is significantly more useful with this feature.

e-kayrakli commented 3 years ago

which made me think we had a consensus on putting it inString.chpl for now. Do you oppose putting it in String, even if the implementation doesn't rely on regexp?

I am OK not putting it in String.chpl. I do think it should be available by default (via standard library or package module) because the language feature of multiline strings is significantly more useful with this feature.

Right. I do not think much more differently at a high level, however, putting it in String worries me more than before because it would be a breaking change if we were to remove it later. Whether we see the interface in String.chpl as part of language (relatively stable) or standard library (relatively unstable) is a factor here.

If we put it in String.chpl with this understanding at least, I'd be more comfortable. And I don't know what that means exactly... a warning with --warn-unstable? A warning in docs? Or just a consensus that we want it to be in String.chpl for good? (If we are all for the last option, I don't feel strongly enough to argue further against it myself :) )

bradcray commented 3 years ago

I feel generally inclined to make this "standard" / auto-available with strings since:

we don't have a pre-existing StringUtils module
this is a method on a string
it feels pretty useful/crucial for certain patterns of triple-quoted strings which are part of the language

I'm open to arguments that it's somehow different from other string methods we currently support by default (e.g., startsWith(), endsWith(), strip()), I'm just not able to formulate those arguments myself.

e-kayrakli commented 3 years ago

Maybe my reluctance is coming from not using multiline strings that much. My concerns are a bit philosophical rather than concrete.

I see it different from strip, because it feels more like an advanced text processing (remove some whitespaces at some parts of the string but not others) rather than a fundamental operation that we may use for every string.

Another perspective that makes me question the need for this in String.chpl is because this is a functionality that would allow us to adjust whitespace while writing multiline strings which is a cosmetic concern that should be external to the implementation of a primitive type in a whitespace-agnostic language.

ben-albrecht commented 3 years ago

@e-kayrakli && @bradcray -- you both make good points.

Ultimately, what's important to me is that dedent is available in standard/package modules.

Some options we could consider in order to move forward for now:

1) Add dedent() to String.chpl, but make it emit a warning for --warn-unstable until we have had more time to live with it and determine if it's a good fit or not. 2) Use this as an opportunity to start a StringUtils (or whatever we want to call it) package module that contains this secondary string method. Maybe this gets inducted into the standard library in the future.

Any opinions or other options worth considering?

bradcray commented 3 years ago

Option 1 seems inherently less impactful to me in that we warn of something known to be unstable (like other such things) and don't have the overhead of creating a new one-routine module and documentation for it that might go away later.

ben-albrecht commented 3 years ago

OK, proceeding with option 1 for now. Thanks for the input.

ben-albrecht commented 3 years ago

Here are a few design elements that came up while implementing, and the answers I chose for the current implementation:

Should dedent(columns > 0) consider tabs as whitespace?
- - current implementation: No, because a tab is ambiguous in terms of how many characters (columns) it would render
Should dedent(columns > 0) leave lines with only whitespace alone?
- current implementation: Yes, for consistency with columns == 0 behavior

Also, while writing tests, I realized an advantage of Python's decision to replace whitespace-only lines with empty strings. Putting the closing triple-quote on a new line and keeping it aligned with the opening triple quote will generate some trailing whitespace in the output if we leave whitespace-only lines untouched, e.g.

var s = """
            a
            b
            c
        """.dedent();
writeln(s);

output:


a
b
c
        # <-- trailing whitespace here

Not a huge deal, but slightly annoying IMO.

Interested in thoughts here or on the PR (https://github.com/chapel-lang/chapel/pull/15664).

bradcray commented 3 years ago

if we leave whitespace-only lines untouched

To clarify, are we always leaving them untouched, or are we only leaving them untouched if they don't have the same leading prefix that we're stripping from all other lines? That is, in your example:

var s = """
            a
            b
            c
        """.dedent();
writeln(s);

If that last blank line had been:

            """.dedent();

I would've expected the prefix to be stripped off like any other line since it shares the same prefix. I tried scanning this issue to refresh my memory of what led us here, but didn't anything suggesting ignoring them completely (but was looking quickly).

ben-albrecht commented 3 years ago

I tried scanning this issue to refresh my memory of what led us here, but didn't anything suggesting ignoring them completely (but was looking quickly).

Looking back, I think I confused your request for not doing what Python does as leave them untouched completely. I think what you're actually suggesting makes a lot more sense. I can switch the behavior pretty easily.

Also, that answers question (2) above:

Should dedent(columns > 0) leave lines with only whitespace alone?

No, dedent(columns > 0) should remove up to columns spaces from lines containing only whitespace.

bradcray commented 3 years ago

OK, cool. Though note that (I think?) your most recent example would still be unchanged even under that interpretation (right?) in that we'd choose the a, b, c lines' whitespace as the amount to strip but the final line doesn't have that much (?). Or would we choose the final line's whitespace as the amount to strip which would leave some space remaining before the a, b, c lines?

If that's right then it makes me wonder "What is the normal way of indenting a """ that's on its own line?" If you're in a Python-savvy editor, does it do any auto-indentation of a closing triple quote? (emacs is either not indenting it or I've got some other problem in my sample that's throwing it off...)

ben-albrecht commented 3 years ago

OK, cool. Though note that (I think?) your most recent example would still be unchanged even under that interpretation (right?) in that we'd choose the a, b, c lines' whitespace as the amount to strip but the final line doesn't have that much (?).

Yeah that's right. 😞

Or would we choose the final line's whitespace as the amount to strip which would leave some space remaining before the a, b, c lines?

I don't think so. I think this would introduce more problems / annoyances.

If that's right then it makes me wonder "What is the normal way of indenting a """ that's on its own line?" If you're in a Python-savvy editor, does it do any auto-indentation of a closing triple quote? (emacs is either not indenting it or I've got some other problem in my sample that's throwing it off...)

I can't find a good source on the preferred style, and I don't see any fancy tricks from any editors I have tried. However, I will note that doc strings are supposed to close the triple quote on a newline:

def complex(real=0.0, imag=0.0):
    """Form a complex number.

    Keyword arguments:
    real -- the real part (default 0.0)
    imag -- the imaginary part (default 0.0)
    """

source: PEP 257

I think it'd be nice to support both styles in Chapel.

At the cost of a little more complexity in the rules, we could find the matching common leading whitespace (margin) across all non-empty lines, and then remove that margin from all lines (still respecting ignoreFirst argument though), including empty lines. For empty lines that do not have the full margin or don't entirely match due to tab/space combinations, we could only remove the matching portion of margin.

var s1 = """
  a // <-- 2 leading spaces
 // <-- 1 leading space, 1 space will get removed
""";

var s2 = """
  a # <-- 2 leading spaces
  # <--  2 leading spaces, 2 spaces will get removed
""";

var s3 = """
  a # <-- 2 leading spaces
   # <--  3 leading spaces, 2 spaces will get removed, 1 space will remain
""";

var s4 = """
  a # <-- 2 leading spaces
 \t# <--  1 leading space followed by 1 tab, 1 space will be removed and 1 tab will remain
""";

// All leading whitespace will be removed from this example now:
var s5 = """
               a
               b
               """;

// All leading whitespace will be removed from this example now:
var s6 = """
         a
         b
         """;

bradcray commented 3 years ago

I'm open to that proposal. Though your latest Python example also gives me less concern than your previous one since the amount of blank space in the final blank line is the same as on the previous ones (i.e., as far as I'm concerned, we could also stick to the simpler interpretation of only eating the whitespace if it's a full prefix match).

chapel-lang / chapel

Add library functions for stripping leading whitespace from strings #8281