JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.88k stars 5.49k forks source link

Taking string concatenation seriously, or, a proposal to deprecate *, ^ for string concatenation and repetition #11030

Closed quinnj closed 7 years ago

quinnj commented 9 years ago

The frequency and vehemency of discussions around this subject beg for a change. * and ^ were introduced for strings back when the language wasn't as strict on operator punning and overall meaning.

As a first, step, I propose we deprecate these two methods for string operations.

As a next discussion, we can talk about the possibility of using a different operator(s) for concatenation/repetition. Just using repeat, with no operator, has been suggested, as well as the following for string concatenation:

Things to consider:

timholy commented 9 years ago

+1 for no infix operators at all. This subject attracts too much noise, and O(n) for a * b * c * d ... concatenation isn't good.

If there is discussion about alternatives, then +100 for moving it to the julia-infix-operator-debates mailing list.

johnmyleswhite commented 9 years ago

+1 for no infix operators at all. This subject attracts too much noise, and O(n) for a * b * c * d ... concatenation isn't good.

+1 to that

staticfloat commented 9 years ago

If there is discussion about alternatives, then +100 for moving it to the julia-infix-operator-debates mailing list.

:laughing:

kmsquire commented 9 years ago

LOL. +1 on the julia-infix-operator-debates.

(I'll personally feel sad to see this use of * and ^ go...)

staticfloat commented 9 years ago

Stefan recently gave a nice, succinct explanation for why he wants to see them go, I'm going to quote it here:

My problem with * for string concatenation is not that people find it unexpected but that it's an inappropriate use of the * generic function, which is agreed upon to mean numerical multiplication. The argument that strings form a monoid is kind of thin since lots of things form a monoid and we're generally not using * for them. At the time I introduced * for strings, we were a lot less strict about operator punning – recall | and & for shell commands – we've gotten much stricter over time, which is a good thing. This is one of the last puns left in the standard library. The reason ++ would be better is not because it would be easier to learn (depends on where you're coming from), but because the ++ operator in Julia would unequivocally mean sequence concatenation.

Note that the operator punning is not an entirely academic concern. The Char corner case shows where the punning can cause problems: people might reasonably expect 'x' * 'y' to produce either "xy" or 241. Because of this, we just make both of these operations no method errors, but it would be perfectly reasonable to allow 'x' ++ 'y' to produce "xy". There's a lot less of a case for having 'x' * 'y' produce 241 or 'ñ', but the sequence concatenation operation does actually make sense.

ssfrr commented 9 years ago

References to prior discussions: https://groups.google.com/forum/#!msg/julia-dev/4K6S7tWnuEs/RF6x-f59IaoJ https://groups.google.com/d/msg/julia-users/nQg_d_n0t1Q/9PSt5aya5TsJ https://groups.google.com/d/msg/julia-users/JnTy-XcfLF8/JeeHREk2TvwJ JuliaLang/julia#1771 JuliaLang/julia#2301

ssfrr commented 9 years ago

I for one agree that ++ as a general sequence concat operator is clear and explicit, and agree that the Char example brought up by Stefan is a good example of where this simplifies things by disambiguating the user's intent.

pao commented 9 years ago

I didn't see this before unleashing a rant on the unsuspecting lastest derailed mailing list thread, where I suggested julia-stringconcat in lieu of the (better) julia-infix-operator-debates. +INTMAX. Kill the infix operators.

ScottPJones commented 9 years ago

I really think you should avoid using anything that already has a meaning (other than string concatenation) in other major languages in the world. I spent too much time seeing bugs because of developers going back and forth between multiple languages that overused simple operators, which meant different things in different languages or in different contexts.

1) You need something that does not have another meaning for vectors, because people who do string processing for a living expect to be able to use strings as vectors of characters (and vice-versa). That would rule out +, *, ^, &, and ||.

2) You need something that is not confusable to most programmers (not just the numerical computing world). That rules out [](empty array), <> (SQL and other languages). I think ++ would be a little confusable, but as it is a unary operator in C/C++/Java/etc., and this would be a binary operator, I think that it would be fine.

3) You need a simple infix operator, at least for concatenate, otherwise you'll get pasted with tons of virtual tomatoes by all of us who are doing string processing.

I'd vote for ++, it is used for concatenation in a reasonably popular language, i.e. Haskell, it does evoke the idea of adding to strings together, i.e. concatenating them, and it does not have any other meaning for vectors/arrays, and could be used as a general vector/array concatenation operator, which is also good (per point 1 above)

simonster commented 9 years ago

I don't think ++ as a general sequence concatenation operator is particularly clear. Does "abc"++[1, 2, 3] return:

If we're going to have a string concatenation operator, I'd rather it just be a string concatenation operator and nothing else. (Has anyone complained about the lack of an infix operator for other sequence concatenation operations?)

I'm also fine with not having a string concatenation operator, but the presence of such an operator in most other languages makes me wonder if I'd miss it if I were doing more string-heavy projects like web stuff. I'm fine with not having an infix operator if we decide we don't need it because interpolation tends to be more useful than concatenation, but if it's because numerical workflows don't do too much concatenation, I'd think twice.

pao commented 9 years ago

Whether there should be a replacement is a decision that can be deferred. For once, can we keep a string concatenation-related issue narrowly defined?

simonster commented 9 years ago

If we're going to introduce a replacement, I think it makes the most sense to deprecate * and introduce the replacement at the same time, so that people can actually use the replacement when they update their code.

ScottPJones commented 9 years ago

@StefanKarpinski you'd also get the nice behavior of "mystring" ++ '\u2000', which very annoyingly doesn't work now with "mystring" * '\u2000'.

@simonstr, it makes sense to me, as somebody who spents most of their time with string processing...

a = Vector{UInt8}[1,2,3]
"abc" ++ a
[97, 98, 99, 1, 2, 3]

(if you combine a Vector with a string, (which is immutable), you'd much rather get back another mutable vector, you can always convert it to an immutable string with UTF8String later)

pao commented 9 years ago

Then this issue will devolve into every other discussion about this ever. The community has already established that it is unable to handle the topic. It's the ultimate bikeshed and there are a lot of colors to choose from.

If I sound irritated by this, it's because I am. Here's my experience. "Hey, you can't glue strings together with +?" "Yeah, that's because we use *." "Oh, okay then." At which point I moved on with my life.

So no, I don't think we should discuss alternative infix operators in this issue, because we'll never make progress if we do.

simonbyrne commented 9 years ago

Does "abc"++[1, 2, 3] return?

Obviously a NaN:

https://www.destroyallsoftware.com/talks/wat

Keno commented 9 years ago

To voice my opinion on the matter, I have used languages whose string concatenation operator was .,+,`(space),and++. When I started julia and learned thatwas the concat operator, my first thought wascool, that makes sense, because I never really liked+. The one argument in favor of not using*I like is the one given by @StefanKarpinski about the ambiguity betweenCharas an integer andCharas a 1 character string. As such, it seems++as a concat operator is reasonable, though in that case we should give it clear semantics. The three options for generic++` (what it should do if the type is equal seems clear) that seem reasonable to me are:

++(x,y) = ++(string(x),string(y))
++(x,y) = #MethodError
++(x,y) = ++(promote(x,y)...)

Where promote promotes an appropriate container type. The last option would imply

x = Uint8[1,2,3]
"abc"++x == Uint8['a','b','c',1,2,3]
ScottPJones commented 9 years ago

@keno, I that's not correct, because 'a' is Char, a 32-bit type. So, the answer would need to be either: UInt8[97, 98, 99, 1, 2, 3], or Char['a','b','c','\x01','\x02','\x03']

FrancoisFayard commented 9 years ago

I vote for ++

ScottPJones commented 9 years ago

Actually, if you have a ASCIIString, it could promote to just UInt8[], but a UTF8String (as well as UTF16String and UTF32String) would need to promote to Char[].

ScottPJones commented 9 years ago

(and that sort of promotion would be very useful for my string processing...)

jiahao commented 9 years ago

This issue could be titled "Taking string concatenation seriously".

carlobaldassi commented 9 years ago

the ambiguity between Char as an integer and Char as a 1 character string.

I'll just note that:

julia-0.4> Char <: Integer
false

julia-0.4> 'a' * 'b'
ERROR: MethodError: `*` has no method matching *(::Char, ::Char)
Closest candidates are:
  *(::Any, ::Any, ::Any)
  *(::Any, ::Any, ::Any, ::Any...)

so no, Char is not an integer, and hasn't been since a while in the 0.4 series, and therefore there's no ambiguity whatsoever. String * Char could perfectly well return the concatenated string, etc. That argument is just obsolete.

mbauman commented 9 years ago

Please let's not subject ourselves to 200+ comments before we feel like it's been taken seriously enough.

Can someone just make a PR? I think everyone is in favor of deprecating *, ^ (if only to remove the mailing list bug). The ++ operator seems to be getting decent traction, but it's obviously tricky and not obvious to make it general. There are tricky semantics (similar to push! vs. append!), poor algorithmic complexity, and there's not a clear need for other iterables. So let's just make it work well for strings (and maybe chars) and call it a day.

Keno commented 9 years ago

@ScottPJones Sure, I was writing it that way for illustrative purposes, since Chars can convert to Uint8s if they are in range. Agreed on the UTF8String promotion problem.

StefanKarpinski commented 9 years ago

@jiahao: This issue could be titled "Taking string concatenation seriously".

LOL.

carnaval commented 9 years ago

Anyone in for a batch order ?

staticfloat commented 9 years ago

I think I'd want one, but can I get it with ++ instead of *?

Okay, sorry. Continuing the injokes is fun, but let's stay focused. Let's try to come up with a bare minimum set of features that a PR could reasonably implement:

Anything that generalizes to other containers I think we can hash out inside the PR.

ScottPJones commented 9 years ago

I want one with ++! :grinning:

ScottPJones commented 9 years ago

@staticfloat :100: :+1:

ScottPJones commented 9 years ago

If we want to have a real "taking strings seriously" discussion, for example, like performance issues related to trying to make strings be \0 terminated, where can we do that? (think about the very common substring or slice operation on a string... with Julia you have to create a new string every time)

vtjnash commented 9 years ago

if we're incurring string breakage anyways, it seems like as good a time as any to eliminate $ too.

my next-best-favorite alternative to not causing breakage is probably the operator-free version (https://github.com/JuliaLang/julia/tree/jb/strjuxtapose)

JeffBezanson commented 9 years ago

+1 to deprecating * and ^ for strings.

I sense a lot of obscurity around the ++ operator. Right now it's nice, for example, that "$a$b" and string(a,b) do exactly the same thing. It would be easy to confuse this with a++b. How often do you need to concatenate a string with an array? That's a strange operation, since it's not clear what the array elements refer to --- could be code points, or raw data.

StefanKarpinski commented 9 years ago

I'm reluctant to even engage in this discussion, but I feel compelled to mention one possibility that has come up in the past (there was even a PR implementing it at one point): using juxtaposition for string concatenation. You would write the following:

"foo"  "bar" # "foobar"
"foo"   bar  # "foo$bar"
 foo   "bar" # "$(foo)bar"
 foo "" bar  # "$foo$bar"

Before, this had the drawback that there was no operator form of it, e.g. that you could pass to reduce, but that's not true anymore since you can use call overloading to make ""(args...) do string concatenation. Thus, you could write reduce("", objs) and get a concatenation of the stringifications of a collection of objects. This could be generalized by this:

julia> call{S<:String}(str::S, args...) = join(args, str)
call (generic function with 934 methods)

julia> reduce("", [1,"foo",1.23])
"1foo1.23"

julia> reduce(",", [1,"foo",1.23])
"1,foo,1.23"
pao commented 9 years ago

If you're about to comment on what @StefanKarpinski just wrote, please read JuliaLang/julia#2301 first.

ScottPJones commented 9 years ago

@stefankarpinski Ugh!!! Had no end of errors in code from Multivalue/Pick applications, because they used juxtaposition... hard to tell just what the code was really doing. Also, what happens with macro arguments... whitespace is significant in Julia, so @foo "Scott" "Paul" "Jones" to a macro expecting 3 arguments just starts breaking, right?

ScottPJones commented 9 years ago

@JeffBezanson If I have to use an Vector{UInt8} or Vector{Char} for mutable strings, to do my string processing, then I really would like to be able to concatenate an immutable string to one of them... just like people complain about not being able to concatenate strings and Chars now, those are both operations that are frequently done.

JeffBezanson commented 9 years ago

But what does concatenating a string with a Vector{UInt8} do? What if the vector contains UTF-8?

ScottPJones commented 9 years ago

@JeffBezanson Concatenating with a Vector(UInt8) and a UTF8String should probably be an error. Concatenation with an ASCIIString would be fine (returning a Vector{UInt8}). Concatentation of a Vector{Char} with a UTF8String should return a Vector{Char} (i.e. do the UTF8->UTF32 conversion first)... for performance, I'd check the UTF8String for how many logical characters first, create the output buffer big enough for both, then copy the Vector{Char} in, and convert the UTF8String right into the buffer...)

ScottPJones commented 9 years ago

Actually, it probably would be better to punt on any concatenations with Vectors, except maybe Vector{Char}, and have a mutable string package, and add methods for ++ there... A lot cleaner, IMO.

JeffBezanson commented 9 years ago

Yes, I agree, it gets a bit complicated otherwise.

stevengj commented 9 years ago

I think it would be a terrible decision not to have any infix operators at all for string concatenation. It should be a clue that nearly every modern general-purpose language has opted to define some infix operator for this operation. And the fact that other languages make many different choices for the operator indicates that there is no ironclad convention that we stray from to our peril.

I agree with @pao that the bikeshed over this is counterproductive, and I find it hard to understand why people care so much about the spelling of this. * is easy to get used to, is not that weird, and Char*Char does not come up often enough to be worth worrying about.

elextr commented 9 years ago

The sequence a * b is an alias for string(a, b) except in the special case where a and b are numerical, oh yeah, or numerical arrays, then it means multiply.

It would be better to give string catenation its own operator so that de-sugaring is always true. And if its not used in any other language then it is fair to all by making everybody equally unhappy :)

That would also make it easier to make a op b op c op d to mean string(a,b,c,d) with the obvious performance implications. So only string() then needs performance optimisations (since at the moment its a very general function).

mauro3 commented 9 years ago

++ is good. What it does for non-strings can be worked out later.

ScottPJones commented 9 years ago

@stevengj 1) Why do you assume that Char ++ Char does not come up often enough to worry about? This is something that bugs me about the discussions here... I see a lot of “this just isn’t important”... but that is just an opinion, and you have people with experience in string processing telling you that it is important. 2) * is rather confusable for lots of people, as I’d say that for most people doing string processing, they’d first think of repetition, never concatenation. I’ve seen many people have brought that up. 3) Maybe the amount of negative comments about * as concatenation operator, going back years from what I’ve seen, should have been a clue that it wasn’t the best decision, and it should have been reconsidered back in version 0.1 or 0.2, not when people want to get 0.4 released...

mschauer commented 9 years ago

@simonster Regarding "abc"++[1, 2, 3]. This is a nice example that the "operator with dot" symbolic inherited from matlab bites us from time to time. To compare it, the concatenation operator in J/APL is , and it comes with a "family of dot operators" distinguished by the slices the operator should work on.

   'abc' , '123'
abc123

'abc' ,"0 '123'
a1
b2
c3

or even

 'abc' ,"1 0 '123'
abc1
abc2
abc3

This doesn't adress the question of type promotion you addressed.

Edit: Argh, I wasted the chance to say nothing

stevengj commented 9 years ago

@ScottPJones, plenty of other languages seem to have string-concatenation infix operators but not char-concatenation operators. I don't see a clamor of complaints. You can still concatenate chars by doing string(char1, char2) (or use length-1 strings as in Python), so there is no missing functionality. If you look at existing code in any widespread language, the number of uses of string concatenation vastly outnumber the number of instances of concatenation of two chars.

Claims that char concatenation is anywhere near as important or useful as string concatenation are simply not plausible.

stevengj commented 9 years ago

There will always be negative comments about spelling choices. (People coming from Python will always complain that we need end rather than using indentation.) Tastes differ, and a few people with strong feelings can make a lot of noise. If we choose ++, I guarantee you that newcomers will still complain — "Why didn't you use +? + is so much more discoverable and intuitive because I am used to it from language X."

It's not so much that I particularly like *; I simply don't care that much. My feeling is that continual code churn over pointless spelling changes is more detrimental to Julia that any benefit we will get from substituting one character for another.

stevengj commented 9 years ago

Aside from all of that, ++ will be extremely painful from an upgrade standpoint. Because ++ does not currently parse as an infix operator, there will be no clean way to maintain backward compatibility with Compat — it will be a flag day upgrade, requiring every package using string concatenation to fork into 0.3 and 0.4 versions (or use string(a,b), giving up on infix concatenation entirely).

elextr commented 9 years ago

The fact is that continual code churn over pointless spelling changes is more detrimental to Julia that any benefit we will get from substituting one character for another.

Yes, it should only ever be changed once, from what it is now to the final state (or no change if thats the decision). Deprecating now and adding an operator later when everyone has changed their code to string(a,b) or "$a$b" is just being mean to the users.

simonbyrne commented 9 years ago

and O(n) for a * b * c * d ... concatenation isn't good.

Can you do better than O(n) for string concatenation?