dart-lang / language

Design of the Dart language
Other
2.65k stars 202 forks source link

Deprecate and remove adjacent strings #1982

Open munificent opened 2 years ago

munificent commented 2 years ago

Early in Dart's history, the + operator was removed on strings. You could not write:

var hello = 'Hello';
print(hello + 'world!');

The argument at the time was that in other languages, users would build up long strings using a series of + concatenations which could involve allocating many many intermediate strings and was much less efficient than using StringBuffer. To prevent users from accidentally writing this suboptimal code, the core library outright forbid concatenating strings with + at all.

This presented a problem. Sometimes, a string literal does not fit in a single line. Without +, there is no easy way to break that literal into multiple lines:

// Long line. :(
var someQuiteLongVariableName = 'This string literal contains no newlines but does not fit on one line.';

To alleviate that, Dart added support for adjacent strings. Like C/C++ (well, really, the C preprocessor that both share), you are allowed to have multiple string literals one after each other. When you do, the language implicitly concatenates them at compile-time into a single string literal.

Some time later, the user-hostile removal of + was reverted and you could use + on strings again like almost every other language. That left two ways to concatenate string literals. Since adjacent strings are shorter, idiomatic Dart code continues to use those.

However, adjacent strings are prone to subtle errors. Consider:

var fruits = [
  'apple',
  'banana'
  'cherry',
  'date'
];

Spot the bug? The compiler won't. This is perfectly valid Dart code. It's just valid Dart code that happens to think "bananacherry" is some kind of exotic fruit. When adjacent strings appear in argument lists, the Dart formatter tries to insert indentation to make bugs like more obvious, but it's still a footgun.

It's also a redundant feature now that we have + one strings. (And any reasonable Dart implementation can evaluate the + at compile time when it sees that the two operands are string literals.)

We are considering simplifying the language and cleaning up some cruft in a future Dart 3.0 release. I propose that we:

  1. Change the style guide to recommend avoiding adjacent strings.
  2. Add support in dart fix to automatically insert + between adjacent strings.
  3. Some time later, mark them deprecated.
  4. Remove adjacent strings in Dart 3.0.
jakemac53 commented 2 years ago

It's also a redundant feature now that we have + one strings. (And any reasonable Dart implementation can evaluate the + at compile time when it sees that the two operands are string literals.)

The primary justifications I have seen for using adjacent strings over + is performance. So if we can remove that as a barrier, this proposal sounds like a win to me. But we should ensure this can consistently be done without affecting any semantics. That includes when strings contain interpolations etc (I can't think of an issue myself though).

I worry slightly that people will not understand that + is actually performant in this case and still avoid using it or make false claims about code being slow in reviews, etc. Or they might think it is efficient for more cases than just concatenating string literals.

munificent commented 2 years ago

The primary justifications I have seen for using adjacent strings over + is performance.

Even a braindead compiler should be able to recognize that a + with literals on both sides is exactly semantically equivalent to how adjacent strings behave today. If we can't implement that right, I'll buy you a beverage of your choice. :)

mateusfccp commented 2 years ago

Particularly, I always use adjacent Strings instead of +.

The first reason is simply that it's cleaner, and I never heard of bugs cause by them. Although it's clear that they can happen, I don't think they are frequent enough to justify it's removal.

The second reason is that +, IMO, gives the impression that I am adding something, instead of concatenating.

A very common case where I use adjacent Strings is as the following example:

// Consider that var1, var2 and var3 are non-Strings
final s = '$var1' '$var2' '$var3';
final s2 = '$var1$var2$var3'; // I think it is a little confusing
final s3 = '$var1' + '$var2' + '$var3'; // More verbose and gives the idea that we are adding string, and not concatenating.
final s4 = var1.toString() + var2.toString() + var3.toString(); // Some coworkers prefer this one, for me it's the worse
lrhn commented 2 years ago

The String.+ was actually removed, not for performance, but to avoid a Java puzzler. At that point it took Object as argument and did .toString() on it, like in Java and JavaScript. String juxtaposition was introduced instead of the removed +. The puzzler was basically:

 print("2 + 2 = " + 
         2 + 2);

which printed 2 + 2 = 22 (still does in Java and JavaScript). We then, much later, after much lobbying, re-added String.+ taking a String as argument and being perfectly safe.

The place where I use adjacent string literals is for strings spanning more than one line, or where I need to change quotes between different parts of the string (mainly between raw and interpolated strings).

I can definitely use + for those.

(If we allowed automatically removing shared indentation from multiline strings, it would be even less common to need adjacent strings).

@mateusfccp The '$var1$var2$var3' is the idiomatic and recommended way. No other approach would get past a code review for me.

munificent commented 2 years ago

A related data point: an automated analysis of Python repositories found several cases of missing commas leading unintended concatenation of adjacent strings. Python did consider removing support for adjacent strings but ultimately decided to leave them in.

mit-mit commented 2 years ago

@pq do we have a lint for this? We recently triaged prefer_interpolation_to_compose_strings, but that is for combining with plus, not just adjacency, I think?

pq commented 2 years ago

We have a few lints in the vicinity:

no_adjacent_strings_in_list

and

prefer_adjacent_string_concatenation

dave26199 commented 2 years ago

It just occurred to me that this could be very expensive to land.

Today, long strings are broken by hand so they wrap to 80 cols; we don't have a tool for it, it's one of the few manual formatting tasks left. This change would force that manual work to happen again in any case where it causes the strings to exceed 80 cols. So for example

  test(
      'this is a very long test description, which does '
      'happen in practice as people like to describe their '
      'tests!', () {
    // Test goes here.
  });

becomes

  test(
      'this is a very long test description, which does ' +
          'happen in practice as people like to describe their ' +
          'tests!', () {
    // Test goes here.
  });

which needs manually fixing to (approximately, I didn't actually count cols):

  test(
      'this is a very long test description, which does ' +
          'happen in practice as people like to describe ' +
          'their tests!', () {
    // Test goes here.
  });

--even if dartfmt is changed to not ident the continuations, there are still 2 extra chars per line; given average word length ~5 chars, +1 for spaces, that's about a 1/3 chance that each wrapped-at-80-cols-with-word-break line now exceeds 80 cols and needs manually rewrapping.

--> huge amounts of churn. I think I would struggle after fixing one or two long test files which use long descriptions.

tl;dr: I don't think this is feasible to change before dartfmt handles joining and splitting string constants to reflow to 80 cols; and if we add that to dartfmt, then actually we don't really need to do anything else, it's no longer the user's problem :)

FWIW, I also like the way it is today; manually splitting strings is painful, I appreciate the extra 2 chars so I have to do it slightly less. I guess I would not care either way if the tooling did it for me.

Thanks.

rakudrama commented 2 years ago

I think the proposed cure of removing adjacent strings is worse than the problems it is trying to solve.

We already have lints that are effective at preventing some of the pitfalls like the exoticfruit example. The programmers UI could also help in various ways, e.g. hovering highlights the whole compound string or there is a little (+) to show the string continues on the next line.

What has not been discussed above is that long strings that are broken into several adjacent strings often contain interpolated values. The proposal would force a combination of + and interpolation where some of the fragments no longer need interpolation, inviting bike-shedding and making reflowing harder.

Lets say you have

//---------------------|
var longname = '$foo.$bar';

but it is too long so you break the string:

//---------------------|
var longname = '$foo.'
    '$bar';

If you are forced to use + the extra space for the operator forces you to move the '.':

//---------------------|
var longname = '$foo' +
    '.$bar';

Now the reviewer (human or robot) points out that foo is a string and you should instead write

var longname = foo +
    '.$bar';

Leaving '$foo' is confusing, they say, because it invites the question of why a singleton interpolation is necessary.

One of the pleasant things about interpolation is that it starts with a quote, so you know without thinking that the result is a string, and that has been lost.

VM generates a worse result for foo + '.$bar' than it does for '$foo.$bar'. This is not insurmountable, but part of the cost of removing adjacent strings would be improving the compilers to be smarter about mixed concatenation and interpolation.

@lrhn I have always thought that this puzzler was disingenuous since it relies on deliberately misleading formatting where there is more whitespace at operators that bind more tightly. dart format would not produces output that is misleading in this way, it would produce something like the following:

print('2 + 2 = ' +
    2 +
    2);
lrhn commented 2 years ago

The Java puzzler was probably based on real issues for Java programmers. That was a long time ago, and I'm not sure anyone doing manual formatting would put the 2 + 2 on two lines. Still, it's only a problem because Java's String + operator coerces the argument to string. Pure string addition is associative, integer addition too, but mixing them with implicit string coercion wasn't. So, forget the puzzler, Dart's current String.+ is not affected by it.

The VM shouldn't need to be less efficient doing foo + '$bar' than '$foo.$bar'. Well, at least not when we remove unsound null safety. Until then, the former needs a null check, and the latter needs a .toString() call. That makes a difference (can't say which one should be more efficient, though, so it's probably just a lack of optimization).

rrousselGit commented 2 years ago

However, adjacent strings are prone to subtle errors. Consider:

var fruits = [
  'apple',
  'banana'
  'cherry',
  'date'
];

About this code, I think one important thing to consider is, it isn't formatted correctly. A CI that checks dart format would fail with this code.

Instead the correctly formatted code is:

var fruits = [
  'apple',
  'banana'
      'cherry',
  'date'
];

This makes the issue a lot more explicit

edit: seems like this was mentioned and I missed it. I would argue that this is a pretty important point though

Is there a benefit other than "avoiding potential mistake" for this? Because if formatting the code makes the issue obvious, I don't like this proposal.

munificent commented 2 years ago

Is there a benefit other than "avoiding potential mistake" for this? Because if formatting the code makes the issue obvious, I don't like this proposal.

It would potentially make it easier to make semicolons optional.

Hixie commented 9 months ago

FWIW, while I agree that concatenated strings in a list are a footgun, in every other situation, I find the extra indenting of the second line to be quite unfortunate (it makes it hard to see what you're actually going to be printing), and I find the + to be less clean than the current style (plus the issues @rakudrama mentions). With the existing lints, the concatenation in lists thing becomes a non-issue, except for https://github.com/dart-lang/linter/issues/4076#issuecomment-1433210394.

munificent commented 9 months ago

FWIW, while I agree that concatenated strings in a list are a footgun, in every other situation, I find the extra indenting of the second line to be quite unfortunate (it makes it hard to see what you're actually going to be printing),

The formatter has a handful of heuristics around when it does and doesn't indent subsequent lines with adjacent strings. It's trying to balance indenting them in places where otherwise it could mask a missing comma mistake while not indenting them elsewhere because as you say it's easier to read when they line up.

If you run into a context where you feel like the heuristic gets it wrong, I would appreciate a bug for it: https://github.com/dart-lang/dart_style/issues. It might be something fairly easy to tweak. (Though probably not until the rewrite is done.)

lrhn commented 9 months ago

An advantage of adjacent strings is that I trust that they are concatenated at compile-time.

If I saw "abc{$expr1}def" + "ghi${expr2}jkl", I wouldn't trust that the implementation would not create two strings and concatenate them at runtime. After all, those two string epxressions are not constant expressions, so the + is not a constant +. We may be able to promise that we can optimize this to not be visibly or performance-wise different from "abc{$expr1}defghi${expr2}jkl". (That is particularly relevant to https://github.com/dart-lang/sdk/issues/34223, where the specification states that each expression in an interpolation is evaluated and converted to a string in order, which means that the visible evaluation order of the "${e1}${e2}" is the same as "${e1}" + "${e2}" (eval e1->v1, v1.toString, eval e2->v2, v2.toString). However, the way the VM implements it the first expression is: eval e1->v1, eval e2->v2, v1.toString, v2.toString. What would the VM do for the + string?)

So, if we remove string juxtaposition and require you to write "stringLiteral '+' stringLiteral", then that should be specified as treating the addition as a single string expression for anything where that matters. And that's confusing, possibly fragile, because, "interpolation1" + "interpolation2" + variable has two different +s with two different meanings.

munificent commented 6 months ago

An advantage of adjacent strings is that I trust that they are concatenated at compile-time.

I guess so, I just don't think that's particularly valuable.

And that's confusing, possibly fragile, because, "interpolation1" + "interpolation2" + variable has two different +s with two different meanings.

By that same token, ["interpolation $constant", "interpolation $variable"] is confusing and possibly fragile because the $ is a constant evaluation in the first element but a runtime evaluation in the second.