MichaelChirico / r-bugs

A ⚠️read-only⚠️mirror of https://bugs.r-project.org/
20 stars 0 forks source link

[BUGZILLA #16745] strsplit(perl=TRUE, pattern="[[:<:]]", ...) gives wrong result #6122

Open MichaelChirico opened 4 years ago

MichaelChirico commented 4 years ago

The perl regex "[[:<:]]" makes zero-length match at the beginning of a word ("[[:>]]" means end-of-word). It acts properly in gregexpr but not in strsplit:

gregexpr("[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] 1 6 11

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]] # [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" # Expect c("One, ", "two; ", "three!"), breaks before chars 1, 6, and 11

strsplit does act as expected for the zero-length look-ahead pattern "[[:>:]]":

gregexpr("[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] 4 9 16

attr(,"match.length")

[1] 0 0 0

attr(,"useBytes")

[1] TRUE

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

Not all zero-length look-behind patterns show this problem. E.g.,

strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]

[1] "One," " two;" " three!"

It may be possible that strsplit is not using the startoffset argument to pcre_exec

pcre/pcre/doc/html/pcreapi.html A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous success. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind.

or it could be something else.


METADATA

MichaelChirico commented 4 years ago

I noted that: Not all zero-length look-behind patterns show this problem. E.g.,

strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]

[1] "One," " two;" " three!"

However, if I expand that pattern to include the zero-length match at the beginning of the string the problem appears again:

strsplit(split="(?<=[[:punct:]])|^", "One, two; three!", perl=TRUE)[[1]]

[1] "O" "n" "e" "," " " "t" "w" "o" ";" " " "t" "h" "r" "e" "e" "!"


METADATA

MichaelChirico commented 4 years ago

Created attachment 2036 [details] Patch to change how strsplit(perl=TRUE) works with zero length matches

Indeed, strsplit(perl = TRUE) doesn't use the start offset. Dealing with zero length matches looks quite tricky, and it is not clear to me what the "proper" behavior is. Anyway, here is a quick, poorly tested patch that appears to work almost as expected by the original poster.

I emphasize that the patch was quite a quick job. User beware.

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "two; " "three!"

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

Tested on Linux, R-devel revision 70276 (PCRE 8.38).


METADATA

INCLUDED PATCH

MichaelChirico commented 4 years ago

Created attachment 2038 [details] Updated patch

Here is another version of the patch with some problems fixed, maybe others introduced... Example output follows.

Original examples:

strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "two; " "three!"

strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", two" "; three" "!"

New examples:

strsplit(split="[[:<:]]|t", "One, two; three!", perl=TRUE)[[1]]

[1] "" "One, " "" "wo; " "" "hree!"

strsplit(split="[[:>:]]|t", "One, two; three!", perl=TRUE)[[1]]

[1] "One" ", " "wo" "; " "hree" "!"

Also, with split pattern "^", the output is quite different than without the patch.

Current implementation:

strsplit("Foo", "^", perl=TRUE)[[1]]

[1] "F" "o" "o"

Patched version:

strsplit("Foo", "^", perl=TRUE)[[1]]

[1] "" "Foo"


METADATA

INCLUDED PATCH