Splitter.onPattern drops last token when zero width matches are used

GoogleCodeExporter commented 9 years ago

If you split on a zero-width regex, the last element may be dropped.

String input = "foo";
String regex = "(?=o)|(?<=o)";
Splitter splitter = Splitter.onPattern(regex);
System.out.println(Arrays.asList(input.split(regex)));
System.out.println(Arrays.asList(Iterables.toArray(splitter.split(input),String.
class)));

This does zero-width lookaround for 'o', so the string will be split before and 
after any o, but the o characters will also be returned as individual items.

Note that String.split works correctly here, but Guava splitter drops the last 
o.

Examining the code, this was probably introduced in this fix (granted, it 
didn't work at all before that):

http://code.google.com/p/guava-libraries/issues/detail?id=936

In particular, this bit of logic in Splitter.java:

        if (offset == nextStart) {
          /*
           * (ommit comment)
           */
          offset++;
          if (offset >= toSplit.length()) {
            offset = -1;
          }
          continue;
        }

neglects to the the "last element" handling that the general logic above does 
(setting end to toSplit.lenth() if no more separators are found).

Original issue reported on code.google.com by travis.d...@gmail.com on 4 Nov 2012 at 10:17

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

In case anyone also runs into this, and is using trimResults() and 
omitEmptyStrings(), you can work around it by just appending a space to your 
input.

Original comment by travis.d...@gmail.com on 4 Nov 2012 at 10:27

GoogleCodeExporter commented 9 years ago

Relevant: issue 936, which was our first attempt to make sensible behavior 
happen on zero-length regexes.

Original comment by wasserman.louis on 4 Nov 2012 at 5:26

Added labels: Package-Base, Type-Defect

GoogleCodeExporter commented 9 years ago

Right, mentioned above also, but I didn't know how to make the fancy link. It's 
worth noting that this issue only happens in the special case that the 
lookaround match is one character long.  That is, the example above, but just 
doubling the characters in the input and lookaround:

        String input = "FfOoOo";
        String regex = "(?=Oo)|(?<=Oo)";

works correctly in Guava. That's because offset >= toSplit.length() logic 
doesn't get triggered above.

Original comment by travis.d...@gmail.com on 4 Nov 2012 at 9:08

GoogleCodeExporter commented 9 years ago

Louis did some work on this internally, but we got stuck on what to do in some 
odd cases. The perlfunc docs list a few:

"""
Note that splitting an EXPR that evaluates to the empty string
     always returns the empty list, regardless of the LIMIT specified.
"""

Guava does not follow this precedent, and I think we're right to make that 
decision.

There are plenty of other quirks that I'll bypass, such as:

"""
     A PATTERN of "/^/" is treated as if it were "/^/m", since it isn't much use otherwise.
"""

Then we get to the interesting ones:

"""
Empty leading fields are produced when there are positive-width matches at the 
beginning of the string; a zero-width match at the beginning of the string does 
not
     produce an empty field.
...
Empty trailing fields, on the other hand, are produced when there is a match at 
the end of the string (and when LIMIT is given
     and is not 0), regardless of the length of the match.
"""

We need to decide what Guava should do with zero-width matches at the beginning 
and end of string. An additional complication is that String.split does not 
completely follow Perl's precedent. We can see this from eagle.xiao's example 
on issue 936. In slightly simplified form:

  System.out.println(Arrays.toString("a".split("(?<=a)|(?=a)", -1)));
[, a, ]

  perl -e 'for (split(/(?<=a)|(?=a)/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a
<blank>

String.split produces a leading empty string; Perl does not.

But String.split usually avoids leading empty strings:

    System.out.println(Arrays.toString("a".split("(?=a)", -1)));
[a]

    System.out.println(Arrays.toString("a".split("^", -1)));
[a]

Both of these match Perl's behavior:

  perl -e 'for (split(/(?=a)/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a

  perl -e 'for (split(/^/, "a", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
a

I wonder if there's a principled reason for the policy difference between 
trailing leading and trailing empty strings or whether it's an accident of 
historical implementation.

We'll have to decide one way or the other. If we're lucky, no one cares either 
way :) As for the rare differences between String.split and Perl, I'm not too 
concerned with what we do there. String.split doesn't appear to document its 
behavior for this case clearly, so if anything, I'd call its behavior a bug. 
But I wouldn't go to great lengths to behave like Perl if the problem lies in 
Pattern and Matcher themselves and thus affects Splitter. (I assume that it 
does.)

Here's a bonus difference between Perl and String.split:

  perl -e 'for (split(/(?<=b)|x/, "fobxx", -1)) { print "$_\n"; }' | sed -e 's/^$/<blank>/'
fob
<blank>
<blank>
<blank>

  System.out.println(Arrays.toString("fobxx".split("(?<=b)|x", -1)));
[fob, x, ]

Original comment by cpov...@google.com on 18 Jan 2013 at 9:54

GoogleCodeExporter commented 9 years ago

Here's a justification for Perl's "empty string always returns the empty list" 
behavior:
http://grokbase.com/t/perl/perl5-porters/0152aed0vc/docs-on-split#20010503im3myf
qpcr36yeb5ixjk5dmkly

My read on that is that the caller should replace...

next unless @f;

...with...

next unless $_;

...at which point the special case is no longer necessary. But perhaps I'm 
misinterpreting some Perl conversion rule.

Original comment by cpov...@google.com on 18 Jan 2013 at 9:57

GoogleCodeExporter commented 9 years ago

Issue 1262 has been merged into this issue.

Original comment by cpov...@google.com on 18 Jan 2013 at 11:04

GoogleCodeExporter commented 9 years ago

1) I just posted to a Perl list asking about the behavior of zero-width matches 
at the beginning and end of string:

http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-01/msg00473.html

2) I should look at Louis's CL, which was intended to fix the original issue 
report but was derailed by my additional questions. I know we decided that some 
of it was hopeless, but I'm hoping it wasn't *all* hopeless... really hoping....

Original comment by cpov...@google.com on 18 Jan 2013 at 11:06

GoogleCodeExporter commented 9 years ago

If splitting ",a" on /,/ yields ["", "a"] then splitting it on /(?=,)/ should 
yield ["", ",a"], should it not? If the separator pattern occurs N times in the 
input, the returned iterable should always be of size N+1.

Original comment by kevinb@google.com on 18 Jan 2013 at 11:37

GoogleCodeExporter commented 9 years ago

In a vacuum, I think I agree, though there is at least one minor downside: It's 
arguably nice to be able to split on // in order to extract its characters. On 
the other hand, we have Lists.charactersOf for this, and Lists.charactersOf 
actually gives you Characters, rather than Strings.

Splitting on zero-width matches is just kind of odd to begin with, and aside 
from //, I'm not familiar with the use cases. I wonder if some of them would be 
better served by high-level interfaces for "extract all matches of this regex" 
or similar.

Original comment by cpov...@google.com on 18 Jan 2013 at 11:51

GoogleCodeExporter commented 9 years ago

I went looking for split() implementations in RE2 (no split method?) and Chrome 
(stringProtoFuncSplit). Chrome cites the ECMAScript spec, which says to ignore 
both leading *and* trailing empty string created by zero-width matches:

"The value of separator may be an empty String, an empty regular expression, or 
a regular expression that can match an empty String. In this case, separator 
does not match the empty substring at the beginning or end of the input String, 
nor does it match the empty substring at the end of the previous separator 
match. (For example, if separator is the empty String, the String is split up 
into individual characters; the length of the result array equals the length of 
the String, and each substring contains one character.)"

ECMAScript also does the questionable thing in which splitting the empty string 
returns 0 elements (regardless of the value of |limit|, as far as I can tell), 
but we can look past that. At least it doesn't have magic trailing-empty 
behavior for negative |limit| :)

It's possible that the ECMAScript spec's detailed algorithm could be of help to 
us in our implementation. Then again, I thought I was already more or less 
happy with Louis's reimplementation, at least setting aside these weird issues 
that I don't think are worth worrying about anymore.

http://www.ecma-international.org/ecma-262/5.1/#sec-15.5.4.14

Finally, I thought a bit more about Kevin's comment #9. In one respect, we have 
four options:

1. ignore both empty leading and trailing strings created by zero-width matches
2. preserve both ...
3. ignore trailing ... but not leading
4. ignore leading ... but not trailing

(1) is what ECMAScript does. I think I prefer this, too.
(2) is what nothing does, at least from what I've seen (Perl+Java+ECMAScript).
(3) is what Perl does, but the inconsistency is difficult to explain. I assume 
it was more of an accidental choice or a result of implementation convenience.
(4) is what nothing does, and it's inconsistent.

(1) has a precedent, and it avoids a leading/trailing inconsistency. I say we 
do that.

Original comment by cpov...@google.com on 2 Feb 2013 at 8:41

GoogleCodeExporter commented 9 years ago

(I see an in-JavaScript implementation of String.split (StringSplit) in 
Chrome's v8 engine. I am too lazy to examine its behavior in detail (nor did I 
look at the C++ Chrome implementation beyond seeing that it had 
sentence-by-sentence comments from the spec), but it cites the same spec 
section as the C++ Chrome implementation. Additionally, I tested a few simple 
regular expressions in Chrome, and I see the behavior that the spec promises.)

Original comment by cpov...@google.com on 2 Feb 2013 at 8:46

GoogleCodeExporter commented 9 years ago

Not that it matters from a general perspective, but for a single data point 
from the original reporter...

My usage assumes (1) and it would definitely be preferred in that case. My 
specific use case is splitting a string, without dropping the delimiters - for 
example, splitting on arithmentic operators in a string like abc*(1+2) would 
return ["abc","*","(","+",")"].  The canonical (only?) way to achieve this kind 
of split which preserves all the characters using regex is apparently these 
zero width matches.

So there is a real-world use case as cpov wondered about above.  In this case, 
leading and trailing empty strings are useless (but I'm trimming whitespace and 
dropping empty strings anyway, so it isn't a huge burden if they appear).

A split of "" will always produce [""], even with (1) and zero-width matches, 
right?

Original comment by travis.d...@gmail.com on 2 Feb 2013 at 9:19

GoogleCodeExporter commented 9 years ago

Sorry, the correct split should be: 

["abc", "*", "(", "1", "+", "2", ")"]

dropping the #s was not intentional

Original comment by travis.d...@gmail.com on 2 Feb 2013 at 9:20

GoogleCodeExporter commented 9 years ago

Thanks for the use case. The example of preserving delimiters is a good reason 
for preferring (1) or (2) over (3) and (4):

$ perl -e 'for (split(/(?=,)|(?<=,)/, ",a,b,c,", -1)) { print "$_\n"; }' | sed 
-e 's/^$/<blank>/'
,
a
,
b
,
c
,
<blank>

Surely nobody wants that, right? Either they want empties at both ends or 
empties at neither. (Obviously the example is dumb as I've written it, but 
imagine that comma is replaced with comma/semicolon/colon/....)

As you point out, the difference between (1) and (2) goes away in the case of 
omitting empty strings, so maybe it's no big deal either way. In the absence of 
other use cases, I kind of like following the ECMAScript precedent, but I 
wouldn't be shocked if the leading and trailing empties were significant to 
someone.

> A split of "" will always produce [""], even with (1) and zero-width
> matches, right?

Yes. I think the way we would look at it, regardless of the (1)/(2)/(3)/(4) 
choice, is similar to what Kevin said above: If the separator occurs n times, 
we produce an output with n+1 items. The difference between the policies would 
be in whether we consider a separator to ever occur before the first character 
or after the last.

Now that you mention it, that's a fun question for the other policies. If we 
allow a zero-width match to create empty strings at the beginning but not the 
end, then I would say that we would likewise have to return a single string 
when asked to split "" on //, since that match would be occurring both at the 
beginning and at the end, and matches at the end are forbidden. The same would 
go for the reverse policy. This ensures the following pattern: If you split 
"foo" on //, you get 4 results. If you split "fo" on //, you get 3. "f" gives 
2, and "" gives 1. Now suppose that we permit zero-width matches at both 
beginning and end. "foo" gives 5 results, "fo" 4," "f" 3, and "" 2. I guess 
that that's reasonable, too. By contrast, if we permit zero-width matches at 
neither beginning nor end, we get "foo" 3, "fo" 2, "f" 1, "" 1. That's arguably 
"worse," but it's more consistent with how normal delimiters work: Splitting 
"f,o,o" on "," produces 3 results, "f,o" 2, "f" 1, and "" 1 again. That's maybe 
too much of an edge case to base a position on, but I'll offer it as support, 
anyway, since it agrees with the way I'm already leaning :)

Original comment by cpov...@google.com on 2 Feb 2013 at 9:48

GoogleCodeExporter commented 9 years ago

Python doesn't allow splitting on empty matches:
http://docs.python.org/2/library/re.html#re.split

For example:
python -c "import re; print re.split(r'$', 'foo')"

Original comment by cpov...@google.com on 12 Feb 2013 at 1:46

GoogleCodeExporter commented 9 years ago

I just got a very helpful response on the Perl list:
http://www.xray.mpe.mpg.de/mailing-lists/perl5-porters/2013-03/msg00390.html

Original comment by cpov...@google.com on 11 Mar 2013 at 2:38

GoogleCodeExporter commented 9 years ago

If we follow the ECMAScript implementation, we should keep in mind that, 
according to a post I just saw:

TIL:
  Javascript: '1 2 3'.split(/ /, 2) = ['1', '2']
  Java: "1 2 3".split(" ", 2) = ["1", "2 3"]

Original comment by cpov...@google.com on 13 May 2014 at 7:15

GoogleCodeExporter commented 9 years ago

This issue has been migrated to GitHub.

It can be found at https://github.com/google/guava/issues/<id>

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:13

Added labels: MigratedToGitHub

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 1 Nov 2014 at 4:18

Changed state: Migrated

GoogleCodeExporter commented 9 years ago

Original comment by cgdecker@google.com on 3 Nov 2014 at 9:08

Added labels: Restrict-AddIssueComment-Commit

dnrajugade / guava-libraries

Splitter.onPattern drops last token when zero width matches are used #1190