String search inconsistency

bkamins commented 7 years ago

The following searches return unexpected results:

julia> search("1", Char[], 2) # is bounds error for start=3
2

julia> search("", 'a', 1) # is bounds error for start=2
0

julia> search("1", 'a', 2) # is bounds error for start=3
0

julia> search("1", "", 2) # probably should be bounds error, 2:1 is not allowed by the documentation
2:1

julia> search("1", "", 20) # probably should be bounds error - similar case
0:-1

julia> search("∀", "1", 4) # should be bounds error, it is an error for start equal to 3 and 5
0:-1

In general the reason is that the functions handle index out of range by one in a special way. But maybe it is intended?

Additionally there is an inconsistency:

julia> search("123", "") # no match indicator
1:0

julia> search("123", Char[]) # match indicator
1

Is it intended?

In summary I have two questions:

if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;
how matching of empty structs like "" and Char[] should be handled - I would say it should be no match in both cases;

booleanings commented 7 years ago

if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;

I'm not so sure about this. Most languages would just return no match. The purpose (totally assuming) being that you might want to check if a string is contained in a string you don't know the length of and don't care if it's out of bounds. I think the functionality is more in substring when you try to substring further than the start index.

Definitely +1 to the Char[] + "" handling though.

bkamins commented 7 years ago

I agree. I simply did not want to change the existing functionality too much (which throws errors).

If we are flexible I would propose the following rules (applied in this order):

if string is empty or chars is empty return no match (0 or 0:-1) respectively
never throw an error but properly handle start values:
- if start<1 then assume start=1
- if start>endof(str) then return no match
- if start is a valid index work as current search works
- if start is an invalid index use nextind(str, start) to start matching (currently search throws an error in this case)

In this way string would never throw an error (if there are any comments to those rules I am open to change them - the key issue for me is to be consistent).

Additionally I would clean up the code not to use string as a name of a variable and rename it to str as string conflicts with a library function. If this would be agreed I can make a proper PR.

nalimilan commented 7 years ago

I'm in favor of throwing an error when an out of bounds index is passed. I don't see the use case for passing an invalid index: if you don't know, don't pass an index, or pass 1. That approach is safer as it helps catching bugs, and we can always change it later; OTC if we accept invalid indices in 1.0 we'll have to retain that behavior in all 1.x releases. It will also be more consistent with recent changes to getindex and SubString.

I'm less sure about what to do when looking for the empty string/empty list of chars. I think the reasoning is that the empty string is found everywhere between two subsequent characters, but it has a zero length, so the first match i 1:0, the second match 2:1 (in ASCII at least), etc. But that doesn't make much sense for an empty list of characters. Either throwing an error or returning 0 sounds fine to me.

Cc: @stevengj

bkamins commented 7 years ago

@ararslan When there is a consensus about the correct behavior it would be good to have 7d244be10431718b3393d8fd97a359017c556a67 merged as it reimplements rsearch to update it accordingly in the PR following this discussion.

ararslan commented 7 years ago

There are still test failures in that PR that I haven't had time to figure out but yeah it'd be good to have that merged soon.

JeffBezanson commented 7 years ago

that the empty string is found everywhere between two subsequent characters, but it has a zero length

:+1:

For an empty list/set of characters, I think the result should be no-match, since we did not find a character that's in the set.

bkamins commented 7 years ago

In #24121 I have made a proposal of a consistent implementation following the discussion. The docstring given there gives all relevant cases that are covered.

In short:

search into an empty string without start index always returns 0 or 0:-1;
search into an empty string with start index given always throws an error;
passed index has to be a valid one unless we search String for Union{Int8, UInt8} in which case it only has to be within bounds (does not have to point to a valid character index);
when a string is passed as the second argument the returned range must meet a predicate s[search(s,x)] == x or 0:-1 is returned;
when a collection of characters is passed as the second argument the returned range must meet a predicate s[search(s,x)] in x or 0 is returned (earlier docstring defined a different predicate here);

nalimilan commented 7 years ago

What do people think about the particular case of search("", "abc") and search("", "abc", 1)? 1 is not a valid index for "", so we could raise an error in the second case, and return 0:-1 in the first case. But that will probably be problematic in practice since that would mean there's no index that works with any string, which is likely to create bugs that can go unnoticed until you test with an empty string. Since start("") == 1, I'd be inclined to accept 1 as a special case instead of throwing an error. See https://github.com/JuliaLang/julia/pull/24116#issuecomment-336801146 and following comments.

bkamins commented 7 years ago

Another option is to remove a third argument (idx) from all search functions altogether. This would simplify the logic of those functions (and would make them easier to understand by users).

I am not sure how important is the case of using idx is, but the consequences of removing this third argument are:

in the case when it is removed we do not have to check if it is valid (and I assume this is the most common use case so we actually improve the performance a bit here);
if user wants to make a search from an index idx it is simple to pass SubString(s, idx) as an argument to a search function and it should have low overhead (and it might be acceptable if this scenario is not used very often in practice).

nalimilan commented 7 years ago

As I noted elsewhere, these functions are going to be merged with find* soon, so they should be consistent with them. So far we always support passing an index. It's been discussed to replace them with a findeach iterator, but even if we add that it's not clear that we wouldn't want to keep the functions accepting an index since they can be useful (e.g. if you need to pass the index around). Also using SubString doesn't sound very intuitive. In general, until the new unified API is implemented, I'd rather avoid changing things too much.

bkamins commented 6 years ago

@nalimilan This is the state of this issue after removal of search:

julia> findfirst("", "123")
1:0

julia> findfirst(equalto(Char[]), "123")

julia>

Different return value, but this is probably intended.

julia> findnext("", "123", 5)
0:-1

julia> findnext("4", "123", 5)
ERROR: BoundsError: attempt to access "123"
  at index [5]
Stacktrace:
 [1] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:8
 [2] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
 [3] _search at .\strings\search.jl:233 [inlined]
 [4] findnext(::String, ::String, ::Int64) at .\strings\search.jl:267
 [5] top-level scope

I am not sure if this is intended.

julia> findnext("", "", 10)
0:-1

julia> findnext(r"", "", 10)
ERROR: BoundsError
Stacktrace:
 [1] findnext(::Regex, ::String, ::Int64) at .\regex.jl:271
 [2] top-level scope

julia> findnext(equalto(Char[]), "", 10)
ERROR: BoundsError: attempt to access ""
  at index [10]
Stacktrace:
 [1] findnext(::Base.EqualTo{Array{Char,1}}, ::String, ::Int64) at .\strings\search.jl:107
 [2] top-level scope

I am not sure if this is intended

nalimilan commented 6 years ago

Actually I haven't really changed the underlying code when merging search into find* functions, so it's not surprising that the same issues remain. It still looks like we could only allow index 1 for the empty string, and throw an error for other indices.

@StefanKarpinski Opinions about these behaviors?

StefanKarpinski commented 6 years ago

We should probably allow 1 through ncodeunits(s)+1 returning nothing and throw a bounds error otherwise. The the default start index of 1 is always allowed and returns nothing if there are no occurrences of the needle.

nalimilan commented 6 years ago

Why +1? Just so that 1 is valid for "" without special-casing it?

StefanKarpinski commented 6 years ago

Yes, that was my thinking. Usually not allowing any valid index for the empty string case gets ugly.

StefanKarpinski commented 6 years ago

Another way to look at it: empty matches are a possibility for regex search and the input value is the start of an empty match in an empty string, so ncodeunits(s)+1 is a valid output index and should therefore also be a valid input index. In general, I've found that you want to throw an error only if the input couldn't possibly lead to an valid output. For example, nextind(s, codeunits(s)+1) is invalid because it's starting out of bounds in the direction that nextind advances. In this case, since findnext(r"", "") == 1:0 it seems clear (to me) that 1 is a valid starting point since it is the starting point of the match that is returned.

JuliaLang / julia

String search inconsistency #24103