Open lucatrv opened 6 years ago
For some added fun, this actually inserts an #undef
at index 2, even though the splice failed:
julia> a
4-element Array{String,1}:
"aa"
"bb"
"cc"
"dd"
julia> splice!(a, 2, "zz")
ERROR: MethodError: Cannot `convert` an object of type Char to an object of type String
Stacktrace:
[1] setindex!(::Array{String,1}, ::Char, ::Int64) at ./array.jl:684
[2] splice!(::Array{String,1}, ::Int64, ::String) at ./array.jl:1248
[3] top-level scope
julia> a
5-element Array{String,1}:
"aa"
#undef
"bb"
"cc"
"dd"
The behavior reported in the OP is confusing and I can't speak to whether it's intended (I suspect not), but I'm going to call this a bug, since if nothing else, inserting an #undef
when the operation fails is almost certainly wrong.
This happens because length(ins)
is used to check whether the replacement ins
is a single item or a collection. The same problem happens with splice!([[1], [2]], 2, [1])
. Reading the docstring again, this seems like the intended behavior: the replacement should be a collection, not a single item. Though the docs could be made more explicit, and the last example should be fixed since it uses a single value (which works since numbers are iterable).
but I'm going to call this a bug, since if nothing else, inserting an #undef when the operation fails is almost certainly wrong.
What is the correct behavior then? Failing mutating operations can have pretty much arbitrary side effects. I don't think it is possible to enforce that any mutation must be resilient to any operation potentially failing.
I think we've had that discussion a few years ago, I think it was about when copy!
fails in the middle of an operation because an element cannot be converted to the destination array, but I can't find it now.
The problem is basically that once you've replaced some data, it's too late to recover it if the operation fail. With splice!
, it would be cheap to keep a copy of the original entry when only one entry is replaced, but when there are several that would require making a copy of the sub-array.
Relying on the state of your data after an operation to mutate it failed seems adjective, so I don't think it is worth spending too much effort thinking about it.
It's mainly annoying at the REPL. Anyway, the problem isn't so much effort IMHO: if there was a way of handling this without a performance penalty, I think it would be worth it, but so far we haven't found a solution.
I understand that this happens only with arrays of strings. In facts the following code works (taken from the manual):
A = [6, 5, 4, 3, 2, 1]
splice!(A, 5, -1)
IMHO it would be preferable for splice!
to have similar behavior either with arrays of numbers or strings.
Couldn't splice!
check whether the provided replacement
is of type String
, and in this case automatically convert it into a one-element string array?
It already treats all types of arguments the same; it always iterates over the argument. We should not add a special case for one type.
I understand that this happens only with arrays of strings. In facts the following code works (taken from the manual):
@lucatrv I as noted above, this works because numbers are iterable, they act as one-element collections. Non-number types other than strings also have the same problem, for example:
A = [:a, :b, :c]
splice!(A, 2, :e)
The fact that splice!
works for numbers without wrapping them in an array shouldn't be a problem, we should just change that example in the manual so that it doesn't confuse people who try with different types.
I've said a few times that I don't think AbstractString
should be directly iteratable. This issue is an example of one of the practical problems that leads me to that belief. The problem is that much of the time we think of a string as just a chunk of text. When we're thinking about a collection of names, or a collection of messages, Char
is not front-of-mind and it is easy to forget that generic collection processing functions will treat a name or a message as a collection of Char
.
The solution I've proposed before is to have a chars(:: AbstractString)
iterator as a peer of codeunits(:: AbstractString)
and graphemes(::AbstractString)
. This seems symmetrical and unbiased and it hides the awkwardness of byte-indexing into UTF-8 String
(and prevind
, nextind
etc).
But maybe I've been thinking about this backwards. If AbstractString
unavoidably a string of Char
s, let it be that, but lets consider a new type AbstractText
.
AbstractText
is just like AbstractString
but it not indexable or iterable, but it has methods for chars
, graphemes
, words
, lines
etc. The simplest implementation would be:
Text{T <:AbstractString} <: AbstractText
s::T
end
I wonder, if all the String
s in Base
were replaced by Text
, how many places would be effected by the loss of indexing and iteration? There would be some. But, there are also many uses of String
where indexing and iteration are never used. If "foobar"
was syntax for creating a Text
, how many people would notice?
(In may text processing systems text is stored as a vector of word or phrase codes. AbstractText
would work as an interface to this kind of representation as well as being a wrapper for plain old String
.)
In such a scheme, should the chars(::AbstractString)
iterator type have O(n) indexing by character? If you view it as a collection of characters then it seems like it should. That's an ok behavior since you explicitly asked for characters. But in what mode does one get efficient code-unit-based character indexing? codeunits(s)
is a collection of code units indexed by code unit, but you don't get characters out of it. Unless you're proposing that codeunits(s)
, chars(s)
and graphemes(s)
all be indexed by code units. That's a more reasonable take, imo, since code unit indexing is what they can all implement efficiently. Forcing people to write string slices as chars(s)[1:n]
seems kind of unnecessarily verbose, however, when slicing by character is the obvious thing to do.
In such a scheme, should the chars(::AbstractString) iterator type have O(n) indexing by character?
I guess so (with the possible exception of AbstractStrings
that are not efficiently indexable at all because they are e.g. unescaped J.I.T. on iteration).
If you view it as a collection of characters then it seems like it should. That's an ok behavior since you explicitly asked for characters.
Agreed
But in what mode does one get byte-based character indexing? codeunits(s) is a collection of code units indexed by code unit, but you don't get characters out of it.
How often is byte-based character indexing useful? The stuff I've written lately using byte-based indexing is only concerned with finding 7-bit syntax characters in a UTF-8 stream, all the multi-byte characters are just passed over in this kind of code. That's kind of the beauty of UTF-8. There are a whole class of problems where you can blissfully pretend that the whole world is still ASCII.
I suspect that if you're doing natural language stuff in LOTEs you'll probably be working in the chars
or graphemes
domains anyway (or using a higher-level text processing system with a coded lexicon, stemming etc).
I guess if indexing Chars
or Graphemes
by byte index is a common thing you could just have char(::String, i)
and grapheme(::String, i)
.
Unless you're proposing that codeunits(s), chars(s) and graphemes(s) all be indexed by code units.
That wasn't my intention, I'd find it least surprising for these to each be indexed in units of its item type. I'd also be happy if these were all efficient as iterators, even if random access indexing has some degenerate cases. You can always to collect(some_iterator)
if you know you need to do lots of random access.
... forcing people to write string slices as
chars(s)[1:n]
What about saying SubString(s, 1:n)
or SubString(s, 1, n)
instead of chars(s)[1:n]
?
How often is byte-based character indexing useful?
It's the only kind of indexing into variable-width encodings that can be done in constant time, so it's kind of essential.
The stuff I've written lately using byte-based indexing is only concerned with finding 7-bit syntax characters in a UTF-8 stream, all the multi-byte characters are just passed over in this kind of code. That's kind of the beauty of UTF-8. There are a whole class of problems where you can blissfully pretend that the whole world is still ASCII.
In that case, you were doing byte-based indexing. If you were forced to use character-based indexing then every indexing operation would be O(n) so all your string code would be slow.
I suspect that if you're doing natural language stuff in LOTEs you'll probably be working in the chars or graphemes domains anyway (or using a higher-level text processing system with a coded lexicon, stemming etc).
Yes, but all of that has to be defined in terms of some lower-level efficient interface... which has to be based on byte indexing since otherwise it would not be efficient.
I guess if indexing
Chars
orGraphemes
by byte index is a common thing you could just havechar(::String, i)
andgrapheme(::String, i)
.
It seems very confusing that char(::String, i)
would be different than chars(::String)[i]
.
I think we're in agreement on the essential-ness of byte-based indexing. It's just that I think of what I was doing as byte-based indexing of bytes, not Chars (because all the characters I cared about also happen to be bytes). The new codeunits
is great for this.
Re confusion:
Right now s::String[i]
is different from collect(c for c in s::String)[i]
.
I think some amount of confusion is unavoidable in an API that exposes the UTF-8-ness of the implementation in the way that String
does (nextind
etc).
We could have char(::String, i)
be char-indexed (and slow) and char(::CodeUnits, i)
be byte indexed (and sometimes throw a bad encoding error).
Or, we could have utf8(::String)
that returns something where byte-index -> Char
,
and chars(::String)
that returns something where char-index -> Char
.
Lets not lock String
1.0 into a particular iteration/indexing model. Lets seperate iteration and indexing into char
, chars
, graphemes
, codeunits
, utf8
etc.
String
could always be made directly itterable/indexable again later, but after 1.0 it's harder to alter if there is some scheme already in place.
.... or, we could have a more opaque Text
type with no indexing or iteration and try to use it as the default. Presumably Text
and String
could be have identical memory layout and differ only in the type bits, so performance shouldn't suffer.
This is an interesting discussion, but frankly, it's way too late. char(codeunits(s), i)
is a really awkward way to get a single character from a string. It is also not even close to something we can optimize to be efficient right now. So far none of the proposed ideas are practical or efficient with our current compiler technology, and proposing a basic API that even requires fancy compiler technology in the first place is a bit questionable.
The basic thing people want from strings is to be able to search for and extract characters and substrings in them. There needs to be some kind of index type to represent locations in strings. For the most part, it is opaque: people don't care what index ,
is at, they just want a value that represents where the ,
is so that they can get things before and after it. These indices needs to be code-unit-based in order to be efficient. (This has nothing to do with UTF-8 in particular, so I'm a bit unclear why you keep mentioning that – the AbstractString model is not specific to UTF-8 at all.)
Even though the indices must be in terms of code units to be efficient, what people want to extract is not code units – it's characters and substrings. So that's how strings work by default: code unit indexing of characters and substrings. The logic is fairly inexorable. It's a somewhat mixed mode, but there's no getting around it: the mixed mode is the only thing that's efficient and does what people want. Certainly there are other modes one may want to interact with a string in:
We already have codeunits(s)
and graphemes(s)
and it makes perfect sense to me to add chars(s)
as a zero-copy wrapper around a string that provides O(n) character-based indexing of characters and substrings. The fact that the "good mode" of interacting with strings is a mix of code unit indices and character values is all the more reason why it should be the default since if I ask for chars
I would expect character-based indexing and values and similarly for code units and graphemes. Unless they're all code unit based, in which case chars
gives exactly what we currently default to.
Obligatory cross-ref: https://github.com/JuliaLang/julia/issues/9297
Oh, one more point: yes, we could require writing chars(s)
in order to interact with a string as a collection of characters. But then it becomes a bit odd that s[i]
extracts a character from s
. Why can I get a characters from s
by indexing into it but I can't get characters from s
by iterating it?
Getting back to the original splice!
issue, allocating space as a result of a failed function call is not a desirable state of affairs. It's unavoidable when something interrupts you mid-operation, but the user sending a bad argument should more preferably be caught early in function without leaving their data in a weird state.
insert!
deals with this problem by doing the conversion first, before doing any allocation. I've attempted that idea on the splice!
functions here. Implementing some version of this idea will avoid the allocation/free-ing of memory during a type-mismatched splice!
call.
(The error messages currently produced are still not ideal though, as they expose and depend on the internals of the function. They're less useful than they could be to the user in diagnosing the issue: trying to run splice!(a, 3, "xyz")
and getting an error about Char
when there's no Char
in your call, would be pretty confusing to a newbie. But I couldn't think of a way to avoid that issue without sacrificing generality and/or efficiency.)
The following code fails:
with message:
while
insert!(a, 2, "zz")
works.With
splice!
I need to writesplice!(a, 2, ["zz"])
.