MichaelChirico commented 4 years ago

The perl regex "[[:<:]]" makes zero-length match at the beginning of a word ("[[:>]]" means end-of-word). It acts properly in gregexpr but not in strsplit:

gregexpr("[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] 1 6 11

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]] # [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" # Expect c("One, ", "two; ", "three!"), breaks before chars 1, 6, and 11

strsplit does act as expected for the zero-length look-ahead pattern "[[:>:]]":

gregexpr("[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] 4 9 16

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

Not all zero-length look-behind patterns show this problem. E.g.,

strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]

[1] "One," " two;" " three!"

It may be possible that strsplit is not using the startoffset argument to pcre_exec

pcre/pcre/doc/html/pcreapi.html A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous success. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind.

or it could be something else.

METADATA

Bug author - Bill Dunlap
Creation time - 2016-03-03 03:05:24 UTC
Bugzilla link
Status - UNCONFIRMED
Alias - None
Component - Misc
Version - R 3.2.3
Hardware - x86_64/x64/amd64 (64-bit) Windows 64-bit
Importance - P5 normal
Assignee - R-core
URL -
Modification time - 2016-03-08 16:17 UTC

MichaelChirico commented 4 years ago

I noted that: Not all zero-length look-behind patterns show this problem. E.g.,

strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]

[1] "One," " two;" " three!"

However, if I expand that pattern to include the zero-length match at the beginning of the string the problem appears again:

strsplit(split="(?<=[[:punct:]])|^", "One, two; three!", perl=TRUE)[[1]]

[1] "O" "n" "e" "," " " "t" "w" "o" ";" " " "t" "h" "r" "e" "e" "!"

METADATA

Comment author - Bill Dunlap
Timestamp - 2016-03-03 04:08:23 UTC

MichaelChirico commented 4 years ago

Created attachment 2036 [details] Patch to change how strsplit(perl=TRUE) works with zero length matches

Indeed, strsplit(perl = TRUE) doesn't use the start offset. Dealing with zero length matches looks quite tricky, and it is not clear to me what the "proper" behavior is. Anyway, here is a quick, poorly tested patch that appears to work almost as expected by the original poster.

I emphasize that the patch was quite a quick job. User beware.

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "two; " "three!"

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

Tested on Linux, R-devel revision 70276 (PCRE 8.38).

METADATA

Comment author - Mikko Korpela
Timestamp - 2016-03-04 17:07:26 UTC

INCLUDED PATCH

ID - 6
Author - Mikko Korpela
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2036
Timestamp - 2016-03-04 17:07 UTC
Extra info - (2.90 KB, patch)

MichaelChirico commented 4 years ago

Created attachment 2038 [details] Updated patch

Here is another version of the patch with some problems fixed, maybe others introduced... Example output follows.

Original examples:

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "two; " "three!"

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

New examples:

strsplit(split="[[:<:]]|t", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "" "wo; " "" "hree!"

strsplit(split="[[:>:]]|t", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", " "wo" "; " "hree" "!"

Also, with split pattern "^", the output is quite different than without the patch.

Current implementation:

strsplit("Foo", "^", perl=TRUE)[[1]]

[1] "F" "o" "o"

Patched version:

strsplit("Foo", "^", perl=TRUE)[[1]]

[1] "" "Foo"

METADATA

Comment author - Mikko Korpela
Timestamp - 2016-03-08 16:17:39 UTC

INCLUDED PATCH

ID - 8
Author - Mikko Korpela
Link to download patch - https://bugs.r-project.org/bugzilla/attachment.cgi?id=2038
Timestamp - 2016-03-08 16:17 UTC
Extra info - (2.97 KB, patch)

MichaelChirico / r-bugs

[BUGZILLA #16745] strsplit(perl=TRUE, pattern="[[:<:]]", ...) gives wrong result #6122

[1] 1 6 11

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

[1] 4 9 16

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

[1] "One" ", two" "; three" "!"

[1] "One," " two;" " three!"

METADATA

[1] "One," " two;" " three!"

[1] "O" "n" "e" "," " " "t" "w" "o" ";" " " "t" "h" "r" "e" "e" "!"

METADATA

METADATA

INCLUDED PATCH

METADATA

INCLUDED PATCH