JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.66k stars 5.48k forks source link

Taking string concatenation seriously, or, a proposal to deprecate *, ^ for string concatenation and repetition #11030

Closed quinnj closed 7 years ago

quinnj commented 9 years ago

The frequency and vehemency of discussions around this subject beg for a change. * and ^ were introduced for strings back when the language wasn't as strict on operator punning and overall meaning.

As a first, step, I propose we deprecate these two methods for string operations.

As a next discussion, we can talk about the possibility of using a different operator(s) for concatenation/repetition. Just using repeat, with no operator, has been suggested, as well as the following for string concatenation:

Things to consider:

ScottPJones commented 9 years ago

@stevengj The languages that I have used (that are also heavily used in string processing) tend not to have a separate character type (M[UMPS], Pick, JavaScript, Lua, Python, and many more...), so it never even comes up as an issue, and ones that do (Java, C++) handle concatenation of strings with characters and characters with characters just fine. I don't know what existing code you've been looking at, but in a large part of the code I dealt with (both internal code and code at 1000s of customers), concatenating characters with characters, and with strings, was done heavily (and specially optimized because of how frequently it was done) [and just 1 of those customers is responsible for around 54% of the medical records within the US]. This isn't a "pointless spelling change", quite a lot of people have brought up a number of good objections to using * for string concatenation... if it didn't aggravate people seriously, you wouldn't see this conversation continuously come up. If you aren't using it that much, and don't particularly care that much, and the people who want to do string processing in Julia see this as a serious issue, why do you object so much? I've never advocated changing * to something else based on taste, or because some other language does it a different way, my arguments have been about: confusability (people are more inclined to think of repetition... (* means make multiples of something!), lack of consistency with vectors (which is an issue for people who do a lot of string processing, we tend to think of strings as being vectors of characters), and issues with using Char consistently (Char * Char, which has been removed, but what about other numeric operators with Char? Sometimes Char acts like an UInt32, sometimes not...)

ScottPJones commented 9 years ago

Pick, for example, uses single character "system delimiters", denoted by things like @RM, @VM, @FM, @SVM... (record, value, file, subvalue marks), so you'd see "Scott":@SVM:"Jones":@VM:1:@SVM:"Memorial":@SVM:"Drive", to build up a record, that in JSON would probably look like [["Scott","Jones"],["1","Memorial","Drive"]]. In Mumps, that would be done like this: "Scott"_$c(1)_"Jones"_$c(0)_1_$c(1)_"Memorial"_$c(1)_"Drive" (but in practice, you'd use a macro for the $c(0) and $c(1), such as $$$subdlm and $$$valdlm... these are the same as Char(0) and Char(1) in Julia) And yes, there are also a lot of places where it is concatenation multiple characters... in M[umps], there is the syntax $char(codepoint,...), so you can have $c(1,a,1,b,2,c)... [where a,b,c are evaluated as integers, i.e. code points).

ScottPJones commented 9 years ago

@simonbyrne I don't even understand why somebody said a * b * c * d should be O(n)... That's crazy! (but maybe it is that way in Julia, another indication that string handling simply has not been taken seriously)

stevengj commented 9 years ago

String-concatenation data from some of the most popular programming languages:

Conclusion: a string-concatenation infix operator is useful, and if we change to anything, it should be + — this is the dominant convention by far, though by no means universal. At least backwards/forwards compatibility will be easy to implement (two lines in Compat.jl). But it is a mere spelling change that adds zero functionality and saves zero characters of code, at the price of a lot of code churn, hence I question its utility.

Also, the data indicate that overloading the meaning of concatenation with an arithmetic operation is not a serious problem: people seem to easily get used to it, and continue to adopt it in new languages.

And the data indicate that syntax for character concatenation is not an important problem — many languages commonly used for string processing do not even have a character type, and many languages that do have a character type either do not have character+string or char+char concatenation, or if they do they don't bother to document it well. If you are doing a lot of character concatenation in inner loops you probably need more specialized code anyway, and if it is not in performance-critical code you can always use string or length-1 strings.

(In any case, since Char is no longer an Integer subtype in 0.4, if we want to we can always add a character concatenation operator at some future time. Or you can add this yourself in your own program if it is an important operation in some specialized use-case. The whole point of Julia's design is that "built-in" functions typically have no particular performance advantage over user code.)

stevengj commented 9 years ago

@ScottPJones, if it allocates a new contiguous string, a * b * c * d is necessarily Ω(n), where n is the total number of chars, because it needs to touch all of the data. You can only do better if it produces a "rope string" or similar data structure that doesn't actually copy the data, but that has its own drawbacks (subsequent operations may be slower). Defaulting to allocating a new contiguous string is hardly "crazy", and in fact it seems to be what every other language defaults to as well.

carlobaldassi commented 9 years ago

I don't think there's any issue at all with a * b * c, at least not any more than there is with string(a,b,c), since it's translated to that:

julia-0.4> @which "a" * "b" * "c"
*(s::AbstractString...) at string.jl:77

That said, I agree 100% with @stevengj so I'll leave it at that.

simonbyrne commented 9 years ago

That said, I agree 100% with @stevengj so I'll leave it at that.

:+1:

StefanKarpinski commented 9 years ago

In case it was unclear, while others may object to * on the grounds that it's unusual or unexpected for programmers coming from other languages, I am NOT concerned about that – this is not the reason I don't like *. I am only concerned that it is an abuse of the meaning of the * generic function. In other words, is string concatenation really a form of multiplication? Accordingly, I'm completely against using + for string concatenation, no matter how many languages may use it: this is a clear and flagrant abuse of the meaning of the + function – in no sense is string concatenation a form of addition.

ScottPJones commented 9 years ago

@stevengj That's your conclusion... I would strongly disagree that it should be +, if anything, for many of the same reasons that * is a problem in Julia, in that it makes a lot of problems when dealing with Chars and being able to act the same on strings and Vectors. Your data is incorrect though, + for Java works just fine with a character, and VB does also (go try them if you don't believe me! [and it is documented]) In Haskell, just do string++[ch] or [ch]++string. Pascal also works just fine (I can send you an example) So, for Python, Perl, Ruby, C++, Java, C#, JavaScript, VB, Fortran, Haskell, Pascal all have NO problem concatenating characters... (also, you see that languages that are really focused more on string processing usually don't have a separate character type, and just one length 1 strings, like the languages I worked on, where everything was a string...) You are left with Go & Objective C that make it a little bit harder... I think you are totally misinterpreting the "data"... it is not that character concatenation is not an important problem, it is simply that for the vast majority of languages, it's handled, works as expected, and so you don't get all the complaints that you are getting about Julia.

StefanKarpinski commented 9 years ago

Python, Perl and Ruby don't have a character type.

ScottPJones commented 9 years ago

@simonbyrne Sorry, I misread that... I was thinking about O(n^2), which is what I've been seeing in Julia lately... (because strings are immutable, and if you have a loop building up a string, it allocates a ton of memory, and spends a lot of time doing GC...)

StefanKarpinski commented 9 years ago

so you don't get all the complaints that you are getting about Julia.

This comes on the bikeshed mailing lists up every few months and sparks a mild discussion, usually just a few mild messages. I've worked with a lot of people doing day-to-day Julia programming, much of it with strings, and this literally never comes up.

ScottPJones commented 9 years ago

@StefanKarpinski That was part of my point, that languages that focus on string processing don't have a special character type, a character is simply a string of length 1... (or Haskell, where strings aren't really special, they are simply [char].

StefanKarpinski commented 9 years ago

because strings are immutable, and if you have a loop building up a string, it allocates a ton of memory, and spends a lot of time doing GC

You don't want to be doing this. You should print to an IOBuffer object instead and then take the string at the end. This is similar to the StringBuilder pattern in Java.

FrancoisFayard commented 9 years ago

Stefan might end up like all the C++ leaders: they claim loudly that using unsigned integers for std::vector subscripting was a major mistake they made, but still most C++ programmers still believe that it's what makes C++ so nice. Fame is coming to Julia ;-)

I vote for ++ by the way.

stevengj commented 9 years ago

@ScottPJones, it is poorly documented then, if you can't find documentation in 10 minutes of Googling on "concatenate character language X". (In general, if you search for "concatenate strings" vs. "concatenate characters" it is immediately obvious which one people care about more.)

ScottPJones commented 9 years ago

@stevengj Umm... for people who care about it, they generally just treat everything as strings, so you won't see "concatenate characters" come up on a search. Doesn't mean that it isn't important, or heavily done... (by that argument, multiply is by definition commutative, as that's what a Google search will tell you!)

StefanKarpinski commented 9 years ago

It's a bit of a hack but to allow Compat to handle ++ it could look for this kind of AST pattern:

julia> :('x'++'y') |> dump
Expr
  head: Symbol call
  args: Array(Any,(3,))
    1: Symbol +
    2: Char x
    3: Expr
      head: Symbol call
      args: Array(Any,(2,))
        1: Symbol +
        2: Char y
      typ: Any
  typ: Any

The .. operator is also available. No one seems to like juxtaposition / "" for concatenation.

jiahao commented 9 years ago

screen shot 2015-04-28 at 11 54 56 am

stevengj commented 9 years ago

@StefanKarpinski, if you aren't worried about the question of familiarity, I question the philosophical purism of the "+ and * are only for arithmetic" viewpoint. This is a language question, hence a question of convention, not correctness, and the vast majority of computer languages use an arithmetic symbol for string concatenation with no apparent distress. Human beings are used to this.

JeffBezanson commented 9 years ago

The character discussion is a tangent. If we had a string concatenation operator, I absolutely agree it should handle characters too.

ScottPJones commented 9 years ago

You don't want to be doing this. You should print to an IOBuffer object instead and then take the string at the end. This is similar to the StringBuilder pattern in Java.

@StefanKarpinski That's just what you have to do in Java or Julia for performance, because of the immutable strings... it doesn't mean that it is easy to use, or that people would understand at first just why Julia is so slow compared to Python doing something like building up a string... (I don't like Java for string processing either, for that reason)

stevengj commented 9 years ago

@StefanKarpinski, Compat cannot look for that AST pattern, because then it will screw up x + +y expressions where x and y are numbers. Granted, that doesn't come up very often, but I'd hate to see @compat perform a transformation that potentially produces incorrect code.

JeffBezanson commented 9 years ago

Aren't python strings also immutable?

FrancoisFayard commented 9 years ago

@ScottPJones I have been teaching math for years. Google says stupid things.

Plus + : In maths, the convention is that + is always a commutative operator Times * : In maths, the convention is that * can be either commutative (as with numbers) or non-commutative (as with matrices). That's why we have commutative and non-commutative rings.

Now, if we go back to this kind of argument, you should stop using + to add floating points because + is not associative with floating points. And + is always associative in mathematics. This is just to show you that this "non-commutative" argument to prevent using + do concatenate strings does not hold. I tend to prefer ++ but I think that algebra arguments should not enter this game.

pygy commented 9 years ago

No one seems to like juxtaposition / "" for concatenation.

White space is already overloaded. It would conflict with macro/hcat contexts. Which brings me to a point I raised on the ML:

Julia already has two concatenating operators, namely h- and vcat. Why not use hcat for string concatenation, MATLAB-style? Does it make any sense to build a matrix of strings?

Whatever it turns out to be (.., ++, [ ... ]), I'm in favour of an explicit, distinct concatenation operator for strings and chars.

mbauman commented 9 years ago

We could add ++ as an operator to 0.4, and then make the deprecation occur once we're on 0.5-dev. Is there a real hurry here?

vtjnash commented 9 years ago

Fortran: // (chars and length-1 strings are treated interchangeably)

seems like a good "rational" option for Julia[1].

i would expect that building a string with */string is roughly O(n*m) in the number of strings being joined (n) and the total number of characters being joined (m). Some sort of string builder object (cStringIO in python, StringBuilder/StringBuffer in Java, IOBuffer in Julia) is essential for good performance when building anything large.


[1] for context:

> typeof(1//2)
Rational
stevengj commented 9 years ago

The purist argument here reminds me of the .+ debacle for array+scalar. When philosophical purism collides with linguistic convention and practicality, purism loses.

JeffBezanson commented 9 years ago

Ah, good discussion of O(n^2) string building and such re: python: http://stackoverflow.com/questions/4435169/good-way-to-append-to-a-string

I like the solution of building a list or array and then calling join if necessary.

I know people build strings all the time, but it seems like something to avoid. If it's for I/O (usually the case) you will obviously get even better performance doing the I/O directly, rather than building a string first and then sending it out. Even things like cPython's optimized append are only amortized O(n), and you're likely to give up a factor of 2 or so.

ScottPJones commented 9 years ago

@mbauman That seems reasonable to me, at least.

jdlangs commented 9 years ago

+1 to @mbauman 's proposal. Everyone seems fine with ++ and having a generic sequence concat operator is quite appealing.

Assuming ++ is being introduced, I think @stevengj 's points on Compat pretty conclusively indicate * can't be deprecated in 0.4.

ScottPJones commented 9 years ago

@vtjnash // has the same problem of already having a meaning in Julia as a binary operator... something that I think it would be nice to avoid...

JeffBezanson commented 9 years ago

Since we're designing by mailing list complaints, keep in mind we haven't yet heard from all those who expect ++ to be an increment operator.

ScottPJones commented 9 years ago

I don't think I ever said that * for concat had to be deprecated immediately, just that it should be, sometime after .. or ++ or whatever is introduced

@JeffBezanson That's why I preferred the Lua .., but it seems most people would rather see ++. I also think that it's not a big deal, because one is unary, and the other binary. (just like DataFrames overloaded ~)

ScottPJones commented 9 years ago

@JeffBezanson When I ran a simple string building test on Python and Julia, right after I first downloaded Julia last month, I saw that it was >800x slower than Python, and made the mistaken assumption that it meant that strings were mutable.. In Mumps, mutable vs. immutable never would come up, because there are no references to things, you just have values stored in associative arrays, in memory or on disk, nothing else. I did the same sorts of optimizations that it seems CPython has done (with reference counts on large strings, copy on write under the hood).

stevengj commented 9 years ago

@ScottPJones, making S=""; for s in list; S*=s; end an O(n) operation for immutable strings rather than O(n^2) is quite different from the discussion here; can I make a plea for focus?

ScottPJones commented 9 years ago

@stevengj It was @timholy and @johnmyleswhite who brought up O(n).. and I unfortunately misread that as O(n^2), which is what I'd seen for building up strings compared to other languages such as Python... my mistake!

tpapp commented 9 years ago

I may be misreading things, but it seems that issue was started not because many people have problems with string concatenation in Julia as it is now, but because many are fed up with the recurring discussions, especially the last one which went on for a while. However, in this case I am not sure that this issue is a problem with Julia per se in the technical sense.

I have re-read a few of the previous discussions and none of them share the tone of the most recent one. Most of them are very friendly: people can't figure out string concat, ask on the list, learn about *, maybe ask about its history/justification, and then generally move on.

If it's the discussions, but not the choice of operator itself that is a concern, maybe it could be handled without changing the language, at least for now. The FAQ could explain the situation, and suggest that new users are kindly asked to refrain from opening the issue up for a while.

IMO this would be the least costly solution, in terms of broken code and programmer hours.

IainNZ commented 9 years ago

This thread raises my blood pressure, which is no easy task, but I feel since it is a design-by-mailing-list type discussion, I just want to throw my (day-to-day Julia-using) weight behind pretty much everything @stevengj says. I think changing the operator to ++ is OK, but will be just triggering a new round of moaning that it isn't just + (and that it looks like increment), as @JeffBezanson points out. I don't understand the issue with string * char given that chars aren't integers anymore - is there any problem there?

If the operator is to change, @mbauman has the right idea of the timeline. I think it'd need to be:

  1. Add as an alternative to * now. No one who is supporting 0.3 and 0.4 can use it. Adding a deprecation warning for *(string,string) would be inappropriate as anyone using 0.4 with a package trying to support 0.3 would be hammered with them.
  2. 0.4 comes out. Packages starting shifting from 0.3/0.4 to 0.4/0.5
  3. Deprecation warning added for 0.5, so packages that support 0.4 and 0.5 only can shift to ++.
  4. 0.5 comes out. Remove * from 0.6.

Is that really worth it?

JeffBezanson commented 9 years ago

Adding a sequence concatenation operator, which we simply don't have, is a legitimate idea. Obviously * won't be used for appending lists or arrays. However perhaps having an operator invites O(n^2) x ++= y loops too much. Technically the syntax and performance are orthogonal, but in practice syntax can have this kind of effect.

ScottPJones commented 9 years ago

@IainNZ I don't think it is really true that Char's aren't integers anymore, it is just that certain operations have been removed (, /, and ^), but I think that just creates other inconsistency issues... i.e. why do + and - work, but not ...)

pao commented 9 years ago

@ScottPJones

julia> Char <: Integer
false

EDIT: ...and I'm with @IainNZ on the blood pressure thing.

ScottPJones commented 9 years ago

@JeffBezanson ++= y wouldn't invite O(n^2) loops any more than *= y does currently, and might give the compiler more of a chance to optimize that maybe?

ScottPJones commented 9 years ago

@pao, correct, I'd noticed that * stopped working between 0.3 and 0.4, and that +/- still did... so those must have been added specifically for the Char type?

JeffBezanson commented 9 years ago

I'm more worried about the problem spreading to other types, assuming ++ supports arrays and lists.

We now treat Char as an ordinal type.

ScottPJones commented 9 years ago

@JeffBezanson I'd say you can't stop people from doing stupid things... try to give them good tools to make their life easier, and try to warn them yes, but don't not change because some people might misuse it...

JeffBezanson commented 9 years ago

I think it would allay my fears almost entirely if we don't have a ++= operator. Many people assume, and I really can't blame them, that an operator like += is mutating.

ScottPJones commented 9 years ago

Having the operator might you the chance of optimizing things like CPython did easier...

stevengj commented 9 years ago

It's not really practical to optimize S=""; for s in list; S*=s; end like CPython, regardless of the spelling. Julia doesn't have reference counts, so there's no way for the *(S,s) function to know that it has the only reference to S and hence that S is safe to mutate. The compiler could figure this out in some cases, of course, but the main point of Julia's design is that it doesn't privilege built-in types over user-defined types. Moreover, magical compiler optimizations make it much more difficult to reason about code performance and to predictably design code for good performance (c.f. CPython vs. PyPy). A ++= operator would change nothing about this, and as @JeffBezanson says it would probably encourage people to over-use array concatenation too. See also the discussion in #7052.

In any case, I don't see it as a big problem that the optimal code structure in Julia is slightly different from the optimal code structure in CPython. It would be a big problem if it were much harder to get good performance in Julia than in CPython, but join(list) is not particularly hard/convoluted, nor is writing to an IOBuffer().

Note also that Julia is not particular unusual in repeated concatenation being O(n^2); see e.g. Ruby or Go. The fact that concatenation (as opposed to e.g. append!) is never mutating is an easy to understand, predictable, common behavior.