Open bkamins opened 7 years ago
if start passed to search is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;
I'm not so sure about this. Most languages would just return no match. The purpose (totally assuming) being that you might want to check if a string is contained in a string you don't know the length of and don't care if it's out of bounds. I think the functionality is more in substring when you try to substring further than the start index.
Definitely +1 to the Char[]
+ ""
handling though.
I agree. I simply did not want to change the existing functionality too much (which throws errors).
If we are flexible I would propose the following rules (applied in this order):
string
is empty or chars
is empty return no match (0
or 0:-1
) respectivelystart
values:
start<1
then assume start=1
start>endof(str)
then return no matchstart
is a valid index work as current search
worksstart
is an invalid index use nextind(str, start)
to start matching (currently search
throws an error in this case)In this way string
would never throw an error (if there are any comments to those rules I am open to change them - the key issue for me is to be consistent).
Additionally I would clean up the code not to use string
as a name of a variable and rename it to str
as string
conflicts with a library function.
If this would be agreed I can make a proper PR.
I'm in favor of throwing an error when an out of bounds index is passed. I don't see the use case for passing an invalid index: if you don't know, don't pass an index, or pass 1
. That approach is safer as it helps catching bugs, and we can always change it later; OTC if we accept invalid indices in 1.0 we'll have to retain that behavior in all 1.x releases. It will also be more consistent with recent changes to getindex
and SubString
.
I'm less sure about what to do when looking for the empty string/empty list of chars. I think the reasoning is that the empty string is found everywhere between two subsequent characters, but it has a zero length, so the first match i 1:0
, the second match 2:1
(in ASCII at least), etc. But that doesn't make much sense for an empty list of characters. Either throwing an error or returning 0
sounds fine to me.
Cc: @stevengj
@ararslan When there is a consensus about the correct behavior it would be good to have 7d244be10431718b3393d8fd97a359017c556a67 merged as it reimplements rsearch
to update it accordingly in the PR following this discussion.
There are still test failures in that PR that I haven't had time to figure out but yeah it'd be good to have that merged soon.
that the empty string is found everywhere between two subsequent characters, but it has a zero length
:+1:
For an empty list/set of characters, I think the result should be no-match, since we did not find a character that's in the set.
In #24121 I have made a proposal of a consistent implementation following the discussion. The docstring given there gives all relevant cases that are covered.
In short:
0
or 0:-1
;String
for Union{Int8, UInt8}
in which case it only has to be within bounds (does not have to point to a valid character index);s[search(s,x)] == x
or 0:-1
is returned;s[search(s,x)] in x
or 0
is returned (earlier docstring defined a different predicate here);What do people think about the particular case of search("", "abc")
and search("", "abc", 1)
? 1
is not a valid index for ""
, so we could raise an error in the second case, and return 0:-1
in the first case. But that will probably be problematic in practice since that would mean there's no index that works with any string, which is likely to create bugs that can go unnoticed until you test with an empty string. Since start("") == 1
, I'd be inclined to accept 1
as a special case instead of throwing an error. See https://github.com/JuliaLang/julia/pull/24116#issuecomment-336801146 and following comments.
Another option is to remove a third argument (idx
) from all search functions altogether. This would simplify the logic of those functions (and would make them easier to understand by users).
I am not sure how important is the case of using idx
is, but the consequences of removing this third argument are:
idx
it is simple to pass SubString(s, idx)
as an argument to a search function and it should have low overhead (and it might be acceptable if this scenario is not used very often in practice).As I noted elsewhere, these functions are going to be merged with find*
soon, so they should be consistent with them. So far we always support passing an index. It's been discussed to replace them with a findeach
iterator, but even if we add that it's not clear that we wouldn't want to keep the functions accepting an index since they can be useful (e.g. if you need to pass the index around). Also using SubString
doesn't sound very intuitive. In general, until the new unified API is implemented, I'd rather avoid changing things too much.
@nalimilan This is the state of this issue after removal of search
:
julia> findfirst("", "123")
1:0
julia> findfirst(equalto(Char[]), "123")
julia>
Different return value, but this is probably intended.
julia> findnext("", "123", 5)
0:-1
julia> findnext("4", "123", 5)
ERROR: BoundsError: attempt to access "123"
at index [5]
Stacktrace:
[1] findnext(::Base.EqualTo{Char}, ::String, ::Int64) at .\strings\search.jl:8
[2] _searchindex(::String, ::String, ::Int64) at .\strings\search.jl:165
[3] _search at .\strings\search.jl:233 [inlined]
[4] findnext(::String, ::String, ::Int64) at .\strings\search.jl:267
[5] top-level scope
I am not sure if this is intended.
julia> findnext("", "", 10)
0:-1
julia> findnext(r"", "", 10)
ERROR: BoundsError
Stacktrace:
[1] findnext(::Regex, ::String, ::Int64) at .\regex.jl:271
[2] top-level scope
julia> findnext(equalto(Char[]), "", 10)
ERROR: BoundsError: attempt to access ""
at index [10]
Stacktrace:
[1] findnext(::Base.EqualTo{Array{Char,1}}, ::String, ::Int64) at .\strings\search.jl:107
[2] top-level scope
I am not sure if this is intended
Actually I haven't really changed the underlying code when merging search
into find*
functions, so it's not surprising that the same issues remain. It still looks like we could only allow index 1
for the empty string, and throw an error for other indices.
@StefanKarpinski Opinions about these behaviors?
We should probably allow 1
through ncodeunits(s)+1
returning nothing
and throw a bounds error otherwise. The the default start index of 1
is always allowed and returns nothing
if there are no occurrences of the needle.
Why +1
? Just so that 1
is valid for ""
without special-casing it?
Yes, that was my thinking. Usually not allowing any valid index for the empty string case gets ugly.
Another way to look at it: empty matches are a possibility for regex search and the input value is the start of an empty match in an empty string, so ncodeunits(s)+1
is a valid output index and should therefore also be a valid input index. In general, I've found that you want to throw an error only if the input couldn't possibly lead to an valid output. For example, nextind(s, codeunits(s)+1)
is invalid because it's starting out of bounds in the direction that nextind
advances. In this case, since findnext(r"", "") == 1:0
it seems clear (to me) that 1
is a valid starting point since it is the starting point of the match that is returned.
The following searches return unexpected results:
In general the reason is that the functions handle index out of range by one in a special way. But maybe it is intended?
Additionally there is an inconsistency:
Is it intended?
In summary I have two questions:
start
passed tosearch
is out of bounds should the function always throw an error (or rather return no match result) - I would say always throw an error;""
andChar[]
should be handled - I would say it should be no match in both cases;