Open MichaelChirico opened 4 years ago
I noted that: Not all zero-length look-behind patterns show this problem. E.g.,
strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]
However, if I expand that pattern to include the zero-length match at the beginning of the string the problem appears again:
strsplit(split="(?<=[[:punct:]])|^", "One, two; three!", perl=TRUE)[[1]]
Created attachment 2036 [details] Patch to change how strsplit(perl=TRUE) works with zero length matches
Indeed, strsplit(perl = TRUE) doesn't use the start offset. Dealing with zero length matches looks quite tricky, and it is not clear to me what the "proper" behavior is. Anyway, here is a quick, poorly tested patch that appears to work almost as expected by the original poster.
I emphasize that the patch was quite a quick job. User beware.
strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "" "One, " "two; " "three!"
strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "One" ", two" "; three" "!"
Tested on Linux, R-devel revision 70276 (PCRE 8.38).
Created attachment 2038 [details] Updated patch
Here is another version of the patch with some problems fixed, maybe others introduced... Example output follows.
Original examples:
strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "" "One, " "two; " "three!"
strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "One" ", two" "; three" "!"
New examples:
strsplit(split="[[:<:]]|t", "One, two; three!", perl=TRUE)[[1]]
[1] "" "One, " "" "wo; " "" "hree!"
strsplit(split="[[:>:]]|t", "One, two; three!", perl=TRUE)[[1]]
[1] "One" ", " "wo" "; " "hree" "!"
Also, with split pattern "^", the output is quite different than without the patch.
Current implementation:
strsplit("Foo", "^", perl=TRUE)[[1]]
[1] "F" "o" "o"
Patched version:
strsplit("Foo", "^", perl=TRUE)[[1]]
[1] "" "Foo"
The perl regex "[[:<:]]" makes zero-length match at the beginning of a word ("[[:>]]" means end-of-word). It acts properly in gregexpr but not in strsplit:
gregexpr("[[:<:]]", "One, two; three!", perl=TRUE)[[1]]
[1] 1 6 11
attr(,"match.length")
[1] 0 0 0
attr(,"useBytes")
[1] TRUE
strsplit(split="[[:<:]]", "One, two; three!", perl=TRUE)[[1]] # [1] "O" "n" "e" ", " "t" "w" "o" "; " "t" "h" "r" "e" "e" "!" # Expect c("One, ", "two; ", "three!"), breaks before chars 1, 6, and 11
strsplit does act as expected for the zero-length look-ahead pattern "[[:>:]]":
gregexpr("[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] 4 9 16
attr(,"match.length")
[1] 0 0 0
attr(,"useBytes")
[1] TRUE
strsplit(split="[[:>:]]", "One, two; three!", perl=TRUE)[[1]]
[1] "One" ", two" "; three" "!"
Not all zero-length look-behind patterns show this problem. E.g.,
strsplit(split="(?<=[[:punct:]])", "One, two; three!", perl=TRUE)[[1]]
[1] "One," " two;" " three!"
It may be possible that strsplit is not using the startoffset argument to pcre_exec
pcre/pcre/doc/html/pcreapi.html A non-zero starting offset is useful when searching for another match in the same subject by calling pcre_exec() again after a previous success. Setting startoffset differs from just passing over a shortened string and setting PCRE_NOTBOL in the case of a pattern that begins with any kind of lookbehind.
or it could be something else.
METADATA