Closed GoogleCodeExporter closed 9 years ago
[deleted comment]
In case anyone also runs into this, and is using trimResults() and
omitEmptyStrings(), you can work around it by just appending a space to your
input.
Original comment by travis.d...@gmail.com
on 4 Nov 2012 at 10:27
Relevant: issue 936, which was our first attempt to make sensible behavior
happen on zero-length regexes.
Original comment by wasserman.louis
on 4 Nov 2012 at 5:26
Right, mentioned above also, but I didn't know how to make the fancy link. It's
worth noting that this issue only happens in the special case that the
lookaround match is one character long. That is, the example above, but just
doubling the characters in the input and lookaround:
String input = "FfOoOo";
String regex = "(?=Oo)|(?<=Oo)";
works correctly in Guava. That's because offset >= toSplit.length() logic
doesn't get triggered above.
Original comment by travis.d...@gmail.com
on 4 Nov 2012 at 9:08
Louis did some work on this internally, but we got stuck on what to do in some
odd cases. The perlfunc docs list a few:
"""
Note that splitting an EXPR that evaluates to the empty string
always returns the empty list, regardless of the LIMIT specified.
"""
Guava does not follow this precedent, and I think we're right to make that
decision.
There are plenty of other quirks that I'll bypass, such as:
"""
A PATTERN of "/^/" is treated as if it were "/^/m", since it isn't much use otherwise.
"""
Then we get to the interesting ones:
"""
Empty leading fields are produced when there are positive-width matches at the
beginning of the string; a zero-width match at the beginning of the string does
not
produce an empty field.
...
Empty trailing fields, on the other hand, are produced when there is a match at
the end of the string (and when LIMIT is given
and is not 0), regardless of the length of the match.
"""
We need to decide what Guava should do with zero-width matches at the beginning
and end of string. An additional complication is that String.split does not
completely follow Perl's precedent. We can see this from eagle.xiao's example
on issue 936. In slightly simplified form:
System.out.println(Arrays.toString("a".split("(?<=a)|(?=a)", -1)));
[, a, ]
perl -e 'for (split(/(?<=a)|(?=a)/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a
<blank>
String.split produces a leading empty string; Perl does not.
But String.split usually avoids leading empty strings:
System.out.println(Arrays.toString("a".split("(?=a)", -1)));
[a]
System.out.println(Arrays.toString("a".split("^", -1)));
[a]
Both of these match Perl's behavior:
perl -e 'for (split(/(?=a)/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a
perl -e 'for (split(/^/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a
I wonder if there's a principled reason for the policy difference between
trailing leading and trailing empty strings or whether it's an accident of
historical implementation.
We'll have to decide one way or the other. If we're lucky, no one cares either
way :) As for the rare differences between String.split and Perl, I'm not too
concerned with what we do there. String.split doesn't appear to document its
behavior for this case clearly, so if anything, I'd call its behavior a bug.
But I wouldn't go to great lengths to behave like Perl if the problem lies in
Pattern and Matcher themselves and thus affects Splitter. (I assume that it
does.)
Here's a bonus difference between Perl and String.split:
perl -e 'for (split(/(?<=b)|x/, "fobxx", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
fob
<blank>
<blank>
<blank>
System.out.println(Arrays.toString("fobxx".split("(?<=b)|x", -1)));
[fob, x, ]
Original comment by cpov...@google.com
on 18 Jan 2013 at 9:54
Here's a justification for Perl's "empty string always returns the empty list"
behavior:
http://grokbase.com/t/perl/perl5-porters/0152aed0vc/docs-on-split#20010503im3myf
qpcr36yeb5ixjk5dmkly
My read on that is that the caller should replace...
next unless @f;
...with...
next unless $_;
...at which point the special case is no longer necessary. But perhaps I'm
misinterpreting some Perl conversion rule.
Original comment by cpov...@google.com
on 18 Jan 2013 at 9:57
Issue 1262 has been merged into this issue.
Original comment by cpov...@google.com
on 18 Jan 2013 at 11:04
1) I just posted to a Perl list asking about the behavior of zero-width matches
at the beginning and end of string:
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-01/msg00473.html
2) I should look at Louis's CL, which was intended to fix the original issue
report but was derailed by my additional questions. I know we decided that some
of it was hopeless, but I'm hoping it wasn't *all* hopeless... really hoping....
Original comment by cpov...@google.com
on 18 Jan 2013 at 11:06
If splitting ",a" on /,/ yields ["", "a"] then splitting it on /(?=,)/ should
yield ["", ",a"], should it not? If the separator pattern occurs N times in the
input, the returned iterable should always be of size N+1.
Original comment by kevinb@google.com
on 18 Jan 2013 at 11:37
In a vacuum, I think I agree, though there is at least one minor downside: It's
arguably nice to be able to split on // in order to extract its characters. On
the other hand, we have Lists.charactersOf for this, and Lists.charactersOf
actually gives you Characters, rather than Strings.
Splitting on zero-width matches is just kind of odd to begin with, and aside
from //, I'm not familiar with the use cases. I wonder if some of them would be
better served by high-level interfaces for "extract all matches of this regex"
or similar.
Original comment by cpov...@google.com
on 18 Jan 2013 at 11:51
I went looking for split() implementations in RE2 (no split method?) and Chrome
(stringProtoFuncSplit). Chrome cites the ECMAScript spec, which says to ignore
both leading *and* trailing empty string created by zero-width matches:
"The value of separator may be an empty String, an empty regular expression, or
a regular expression that can match an empty String. In this case, separator
does not match the empty substring at the beginning or end of the input String,
nor does it match the empty substring at the end of the previous separator
match. (For example, if separator is the empty String, the String is split up
into individual characters; the length of the result array equals the length of
the String, and each substring contains one character.)"
ECMAScript also does the questionable thing in which splitting the empty string
returns 0 elements (regardless of the value of |limit|, as far as I can tell),
but we can look past that. At least it doesn't have magic trailing-empty
behavior for negative |limit| :)
It's possible that the ECMAScript spec's detailed algorithm could be of help to
us in our implementation. Then again, I thought I was already more or less
happy with Louis's reimplementation, at least setting aside these weird issues
that I don't think are worth worrying about anymore.
http://www.ecma-international.org/ecma-262/5.1/#sec-15.5.4.14
Finally, I thought a bit more about Kevin's comment #9. In one respect, we have
four options:
1. ignore both empty leading and trailing strings created by zero-width matches
2. preserve both ...
3. ignore trailing ... but not leading
4. ignore leading ... but not trailing
(1) is what ECMAScript does. I think I prefer this, too.
(2) is what nothing does, at least from what I've seen (Perl+Java+ECMAScript).
(3) is what Perl does, but the inconsistency is difficult to explain. I assume
it was more of an accidental choice or a result of implementation convenience.
(4) is what nothing does, and it's inconsistent.
(1) has a precedent, and it avoids a leading/trailing inconsistency. I say we
do that.
Original comment by cpov...@google.com
on 2 Feb 2013 at 8:41
(I see an in-JavaScript implementation of String.split (StringSplit) in
Chrome's v8 engine. I am too lazy to examine its behavior in detail (nor did I
look at the C++ Chrome implementation beyond seeing that it had
sentence-by-sentence comments from the spec), but it cites the same spec
section as the C++ Chrome implementation. Additionally, I tested a few simple
regular expressions in Chrome, and I see the behavior that the spec promises.)
Original comment by cpov...@google.com
on 2 Feb 2013 at 8:46
Not that it matters from a general perspective, but for a single data point
from the original reporter...
My usage assumes (1) and it would definitely be preferred in that case. My
specific use case is splitting a string, without dropping the delimiters - for
example, splitting on arithmentic operators in a string like abc*(1+2) would
return ["abc","*","(","+",")"]. The canonical (only?) way to achieve this kind
of split which preserves all the characters using regex is apparently these
zero width matches.
So there is a real-world use case as cpov wondered about above. In this case,
leading and trailing empty strings are useless (but I'm trimming whitespace and
dropping empty strings anyway, so it isn't a huge burden if they appear).
A split of "" will always produce [""], even with (1) and zero-width matches,
right?
Original comment by travis.d...@gmail.com
on 2 Feb 2013 at 9:19
Sorry, the correct split should be:
["abc", "*", "(", "1", "+", "2", ")"]
dropping the #s was not intentional
Original comment by travis.d...@gmail.com
on 2 Feb 2013 at 9:20
Thanks for the use case. The example of preserving delimiters is a good reason
for preferring (1) or (2) over (3) and (4):
$ perl -e 'for (split(/(?=,)|(?<=,)/, ",a,b,c,", -1)) { print "$_\n"; }' | sed
-e 's/^$/<blank>/'
,
a
,
b
,
c
,
<blank>
Surely nobody wants that, right? Either they want empties at both ends or
empties at neither. (Obviously the example is dumb as I've written it, but
imagine that comma is replaced with comma/semicolon/colon/....)
As you point out, the difference between (1) and (2) goes away in the case of
omitting empty strings, so maybe it's no big deal either way. In the absence of
other use cases, I kind of like following the ECMAScript precedent, but I
wouldn't be shocked if the leading and trailing empties were significant to
someone.
> A split of "" will always produce [""], even with (1) and zero-width
> matches, right?
Yes. I think the way we would look at it, regardless of the (1)/(2)/(3)/(4)
choice, is similar to what Kevin said above: If the separator occurs n times,
we produce an output with n+1 items. The difference between the policies would
be in whether we consider a separator to ever occur before the first character
or after the last.
Now that you mention it, that's a fun question for the other policies. If we
allow a zero-width match to create empty strings at the beginning but not the
end, then I would say that we would likewise have to return a single string
when asked to split "" on //, since that match would be occurring both at the
beginning and at the end, and matches at the end are forbidden. The same would
go for the reverse policy. This ensures the following pattern: If you split
"foo" on //, you get 4 results. If you split "fo" on //, you get 3. "f" gives
2, and "" gives 1. Now suppose that we permit zero-width matches at both
beginning and end. "foo" gives 5 results, "fo" 4," "f" 3, and "" 2. I guess
that that's reasonable, too. By contrast, if we permit zero-width matches at
neither beginning nor end, we get "foo" 3, "fo" 2, "f" 1, "" 1. That's arguably
"worse," but it's more consistent with how normal delimiters work: Splitting
"f,o,o" on "," produces 3 results, "f,o" 2, "f" 1, and "" 1 again. That's maybe
too much of an edge case to base a position on, but I'll offer it as support,
anyway, since it agrees with the way I'm already leaning :)
Original comment by cpov...@google.com
on 2 Feb 2013 at 9:48
Python doesn't allow splitting on empty matches:
http://docs.python.org/2/library/re.html#re.split
For example:
python -c "import re; print re.split(r'$', 'foo')"
Original comment by cpov...@google.com
on 12 Feb 2013 at 1:46
I just got a very helpful response on the Perl list:
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-03/msg00390.html
Original comment by cpov...@google.com
on 11 Mar 2013 at 2:38
If we follow the ECMAScript implementation, we should keep in mind that,
according to a post I just saw:
TIL:
Javascript: '1 2 3'.split(/ /, 2) = ['1', '2']
Java: "1 2 3".split(" ", 2) = ["1", "2 3"]
Original comment by cpov...@google.com
on 13 May 2014 at 7:15
This issue has been migrated to GitHub.
It can be found at https://github.com/google/guava/issues/<id>
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:13
Original comment by cgdecker@google.com
on 1 Nov 2014 at 4:18
Original comment by cgdecker@google.com
on 3 Nov 2014 at 9:08
Original issue reported on code.google.com by
travis.d...@gmail.com
on 4 Nov 2012 at 10:17