babel / babel

🐠 Babel is a compiler for writing next generation JavaScript.
https://babel.dev
MIT License
42.99k stars 5.59k forks source link

Support `y` sticky flag for regular expressions #904

Closed getify closed 9 years ago

getify commented 9 years ago

I did a quick glance at open and past-closed issues and didn't find this mentioned (sorry if I missed it)... would be great to know the plans for support of:

var re = /o+/y;

AFAICT, the sticky flag y makes the regular expression remember the last match's end and starts from there on subsequent matches (similar to the g flag with exec(..)).

FYI: I find the MDN page for this to be particularly unhelpful in explaining it, as the example cited there is basically the same as a g + exec(..) example.

sebmck commented 9 years ago

Hm, not sure how this would be implemented. It'd have to be used in conjunction with a polyfill/stdlib like core-js that would shim in the regex behaviour to the regex instance methods and Babel would then need a way to let it know that the regex is sticky. /cc @zloirock

Also yeah, your interpretation of sticky regexes is correct. MDN is a bit vague sometimes and not very user friendly.

developit commented 9 years ago

XRegExp has an implementation: http://xregexp.com/flags/ https://github.com/slevithan/xregexp/blob/master/src/xregexp.js#L60 (though it is not a polyfill)

sebmck commented 9 years ago

@developit Their sticky regex detection is bad and will never correctly work.

sebmck commented 9 years ago

Could probably just do something like:

var foo = /foo/y;
var _regex;
var foo = _regex = /foo/, (_regex.sticky = true), _regex; // might have to be `_sticky` or something, not sure how it'll play across environments

and then leave it up to core-js to deal with. What do you think @zloirock, worth it? Might be tricky to get it to work consistently across all the different scenarios.

neVERberleRfellerER commented 9 years ago

If efficiency is sacrificiable, we can do some small (mabe not so small) evil:

var _reExec = (function () {
  var _exec = RegExp.prototype.exec;
  return function exec(str) {
    str = str.substr(this.lastIndex || 0);
    var match = _exec.call(this, str);
    if (!match) return match;
    this.lastIndex += str.indexOf(match[0]) + match[0].length;
    return match;
  };
})();

var regex = (function () {
  var re = /(\S+) line\n?/; // remove y from modifiers
  re.exec = _reExec;
  return re;
})();
sebmck commented 9 years ago

@neVERberleRfellerER You can pass a regex to "".split, "".match etc so that only covers one use-case.

getify commented 9 years ago

FYI: the spec says that split(..) in particular ignores the sticky flag.

neVERberleRfellerER commented 9 years ago

@sebmck Indeed, there is always a catch. And it did not even crossed my mind. I should get a break. However, I do not get how split for example could work with this. Update split solved. Thank you, @getify

sebmck commented 9 years ago

@getify Ah great, haven't had a need to read that part yet. Still need to take into consideration match though which makes things kinda tricky :worried:

sebmck commented 9 years ago

@neVERberleRfellerER Neither do I, was just throwing around methods that can accept regexes.

getify commented 9 years ago

I'm not positive if my interpretations here are entirely correct, but I wrote up some details about how I think sticky works:

https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag

Help in vetting that info would be appreciated, as you explore this possible fix for Babel.

neVERberleRfellerER commented 9 years ago

I have just tested this in io.js (according to MDN V8 supports the 'y' flag when turned on by flag --- mindblown --- without comments) and Firefox (supports 'y', but has bugged '^' with it, does not matter here), both have same results:

var re = /foo/y;
var str = 'foofoofoo';

re.exec(str);
[ 'foo', index: 0, input: 'foofoofoo' ]

re.exec(str);
[ 'foo', index: 3, input: 'foofoofoo' ]

re.exec(str);
[ 'foo', index: 6, input: 'foofoofoo' ]

re.exec(str);
null

Neither Firefox nor io.js work with samples on @getify s page. I do not know why, but I am still testing.

Update: To make the first search work, I had to set lastIndex to 1 before first use. And even then it does not work with string methods. Only with RegExp methods.

var re = /o+./y;
var str = "foot book more";
re.lastIndex = 1;

re.exec(str);
> [ 'oot', index: 1, input: 'foot book more' ]

To make the next search working, i have to set last index again - to the index of next expected match,

Update 2: According to description on MDN this behavior is correct.

getify commented 9 years ago

@neVERberleRfellerER

I don't trust the MDN page at all. I also don't trust V8's implementation.

Also, I am not particularly interested in whether exec(..) works or not, because we already know it works with the g global flag.

What's interesting is whether it works with all the other regex-using methods. The spec says that all of them should funnel through the internal exec algorithm, so they should all work (as I was depicting -- I think). But I don't trust that engines are actually sharing that code.

neVERberleRfellerER commented 9 years ago

@getify It works with every RegExp method, not only exec. It does not work with string.match. This probably comes from the fact, that after intercepting RegExp methods, it does not call any of them (the only interesting method is RegExps match - this is the only one that should be really called) and works as if nothing have changed in both FF and V8.

neVERberleRfellerER commented 9 years ago

@getify I believe that implementations are correct with the behavior decribed in my comment with two updates.

15. Repeat, while matchSucceeded is false
    ...
    b. Let r be matcher(S, lastIndex).
    c. If r is failure, then
        i.  If sticky is true, then
             1. Let setStatus be Set(R, "lastIndex", 0, true)
             2. ReturnIfAbrupt(setStatus).
             3. Return null.
zloirock commented 9 years ago

I thought about it. Consider proposals. Please, open issue in core-js repo.

@developit XRegExp uses native in FF, but not adds implementation.

XRegExp('.', 'y')
SyntaxError: invalid regular expression flag y

@sebmck be better at first select a way. Maybe replace it to constructor call, /foo/y -> RegExp('foo', 'y'), be better.

sebmck commented 9 years ago

@zloirock Ah yeah, that's a much better way actually.

getify commented 9 years ago

@neVERberleRfellerER

I apologize if I'm dense and missing something, but my reading of the ES6 spec draft does not match (pun intended) your interpretations, if I understand them. Let me observe some things about how I'm reading it:

  1. 21.1.3.11: String.prototype.match(..) looks for, and finding it, delegates to the @@match on the regex.
  2. 21.2.5.6: RegExp.prototype[@@match] first looks for if the expression is global, and if not (21.2.5.6.7), it delegates to the abstract operation RegExpExec.
  3. 21.2.5.2: RegExpExec calls exec(..) on the regular expression, then delegates to RegExpBuiltinExec to process the results.
  4. 21.2.5.2.2: RegExpBuiltinExec of course is our main exec abstract operation, which checks (21.2.5.2.2.8) sticky and, skipping through your quoted step 15 because matchSucceeded gets set to true, proceeds to use sticky (21.2.5.2.2.18), which updates lastIndex. Rinse, repeat.

What am I missing here? I sure read that to be that String.prototype.match funnels into the sticky-respecting exec(..). No?

[Edit]: Just remembered I had, on a different topic entirely, recently asked this question of @allenwb, and he'd confirmed that String#match(..) does in fact delegate to exec(..).

sebmck commented 9 years ago

Added the sticky regex desugaring suggested by @zloirock so it can potentially be delegated to core-js etc. ie.

var foo = /bar/y;

turns into

var foo = new RegExp("bar", "y");
neVERberleRfellerER commented 9 years ago

@getify

What am I missing here?

Step 21.2.5.2.2.14 probably? matchSucceeded is set to false and following steps are as I have said.

String.prototype.match funnels into the sticky-respecting exec(..)

As stated in 21.1.3.11 String.prototype.match uses @ @ match which is defined in 6.1.5.1 as

A regular expression method that matches the regular expression against a string.

So yes, you are right with this. However:

Let matcher be GetMethod(regexp, @ @ match) // from 21.1.3.11

is not correct in FF and V8 (warning - source: own research).

Update Sorry for tagging user match.

Update 2 Everything was updated. Sorry for possible misunderstanding and confusion.

Update 3 For example, this works in both FF and V8

> 'kool'.match({ source: 'o+.' })
[ 'o', index: 1, input: 'kool' ]
developit commented 9 years ago

Swapping the literal for a constructor looks perfect, wish I'd thought of that.

getify commented 9 years ago

@neVERberleRfellerER

is not correct in FF and V8 (warning - source: own research).

Perhaps I've misunderstood even your premise of discussion. Is your position:

neVERberleRfellerER commented 9 years ago

@getify

Or... that I'm correct that by spec such delegation should ocur, but that FF and v8 currently do not and that they're wrong as a result?

This.

However, the most important part of discussion is the behavior of sticky regexpes (for example when using exec directly). My stance and review of discussed problem (to make sure that we are discussing the same thing):

Implementations are corrent, when they return null and set lastIndex to 0 when match does not coccur exactly at lastIndex. This is in conflict with your samples (https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag).

Your stance:

RegExpBuiltinExec of course is our main exec abstract operation, which checks (21.2.5.2.2.8) sticky and, skipping through your quoted step 15 because matchSucceeded gets set to true, proceeds to use sticky (21.2.5.2.2.18), which updates lastIndex. Rinse, repeat.

But matchSucceeded is set to false in step 14 and step 15 is evaluated. Then, if machSucceeded is set, it will get to step 18 (through step 16 and eventually 17). When match in step 15 results in failure, step 15.c.i is taken which results in null returned and lastIndex set to zero. Which is consistent with behvior of both tested implementations (FF, V8),

getify commented 9 years ago

@neVERberleRfellerER

I'm glad I clarified, because I thought you were saying something entirely different and I was arguing that diff point. :)

This is in conflict with your samples

Wait, how is it in conflict with my examples? Both my g+exec(..) and my y examples show the last failed match returning lastIndex to 0, as I quote myself:

...
re.exec( str );     // null -- no more matches!
re.lastIndex;       // 0 -- starts over now!

and

...
re.test( str );     // false -- no more matches!
re.lastIndex;       // 0 -- starts over now!

Isn't that not indeed in agreement both with you and the spec?

[Edit]: All the stuff in your previous message, where you quote "my stance"... none of that was arguing against lastIndex === 0 at the end of all matches, it was arguing FOR sticky when there's still matches left. :)

neVERberleRfellerER commented 9 years ago

@getify Problem is in this (taken from 'y' samples)

var re = /o+./y,    // <-- look, `y` now!
    str = "foot book more";

str.match( re );    // ["oot"]
re.lastIndex;       // 4

When match does not occur at lastIndex, it should set lastIndex to 0 (it already is in this case) and return null. So, I expect the following output:

var re = /o+./y,    // <-- look, `y` now!
    str = "foot book more";

str.match( re );    // null
re.lastIndex;       // 0

Because 15.c.i is taken in this case.

Update reaction to [edit]ed part:

No, but it was (I think at last) arguing against lastIndex === 0 and null result on first match call.

getify commented 9 years ago

@neVERberleRfellerER

Ahh, further clarification of which example you were concerned about! :)

When match does not occur at lastIndex, it should set lastIndex to 0 (it already is in this case) and return null. So, I expect the following output:

I don't interpret that at all. Let me explain. From 25.2.5.2.2, we get:

...
4. Let lastIndex be ToLength(Get(R,"lastIndex")). <-- `lastIndex` is initially `0` here, right?
...
14. Let matchSucceeded be false.
15. Repeat, while matchSucceeded is false
     a. If lastIndex > length, then               <-- nope
     b. Let r be matcher(S, lastIndex).           <-- should succeed, right? `o+.` : `"oot"`
     c. If r is failure, then                     <-- nope, `r` succeeded, right?
     d. else
          ...
          ii. Set matchSucceeded to true.
16. Let e be r's endIndex value.
...
18. If global is true or sticky is true,
     a. Let setStatus be Set(R, "lastIndex", e, true).

So I guess where we're differing is 15.b... i see it succeeding, since at that point lastIndex is 0, and it can indeed find "oot" (at position 1).

What am I missing?

neVERberleRfellerER commented 9 years ago

@getify

So I guess where we're differing is 15.b

Indeed. This is the only difference. It should not succeed and the reason why non-sticky regexps succeed is, that they start again from next position, since 15.c.ii is taken:

Let lastIndex = lastIndex+1.

Update Some samples (Update 2 warning: untested)

var str = "foot book more";

// this should work
var re = /.*?o+./y;
re.lastIndex = 0;
re.exec(str); // ['foot']

//this should work
var re = /.*?o+./y;
re.lastIndex = 1;
re.exec(str); // ['oot']

// this should not work
var re = /o+./y;
re.lastIndex = 0;
re.exec(str); // null

// this should work
var re = /o+./y;
re.lastIndex = 1;
re.exec(str); // ['oot']
getify commented 9 years ago

@neVERberleRfellerER

It should not succeed

I still don't understand why? matcher is the internal regexp engine [[RegExpMatcher]], and it's called the very first time through, in 15.b, essentially as matcher("foot book more",0).

My understanding is that a regular expression engine does its own internal forward and back-tracking stuff looking for suitable matches, so even though it doesn't match o+. at position 0, it does internally advance to position 1, and indeed find a sufficient match there, which is the "oot", and it returns that match result as r, which would be success.

I don't see that the looping (and incrementing of lastIndex) semantics implied in step (15) are performing the actual searching in the string for the first/next match.

Am I fundamentally misunderstanding how the internal regex matcher engine works?

neVERberleRfellerER commented 9 years ago

According to 21.2.3.2.2 (RegExpInitialize):

  1. Set obj’s [[RegExpMatcher]] internal slot to the internal procedure that evaluates the above parse of P by applying the semantics provided in 21.2.2

According to 21.2.2.2 (Pattern):

NOTE: A Pattern evaluates ("compiles") to an internal procedure value. RegExp.prototype.exec and other methods can then apply this procedure to a String and an offset within the String to determine whether the pattern would match starting at exactly that offset within the String

According to 21.2.5.2, the exec uses RegExpBuiltinExec only (21.2.5.2 point 7), and thus I believe, that the internal procedure mentioned in 21.2.2.2 is [[RegExpMatcher]].

Since the internal procedure from 21.2.2.2 (Pattern) matches exactly at specified offset, the 21.2.5.2.2 point 15.b should indeed fail.

Warning: this is a minefield.

getify commented 9 years ago

So if I'm getting an understanding of the intent of "sticky" (I guess I haven't yet, to this point), it takes a pattern like /foo/ and implies that for a match to be made, the "f" must be at ONLY the current value of lastIndex (0 by default, or whatever it's set to later), and no further forward (or backward) in the input string.

Is that correct?

That seems to imply that "sticky" is useless for repeatedly stepping through matches (unless they're always entirely contiguous/adjacent), as is typically done with g + exec(..).

Because wouldn't this then fail?

var re = /h./y, str = "hello, how are you?"

re.lastIndex;       // 0
re.exec( str );     // ["he"]

re.lastIndex;       // 2
re.exec( str );     // null

If that's how it should work, I'm completely lost on the use cases of it? Also every example (MDN, etc) I've found implies that it's closely related to the repeated g + exec(..) usage, but it clearly is not the same.

allenwb commented 9 years ago

First, NOTEs are non-normative, which means that they aren't supposed to add anything (except clarification) that isn't in the normative text.

The matcher "internal procedure"produced by 21.2.2.2 matches at the offset that is passed to it via its 'index parameter. a different index can be supplied each time it is called at 15.b in 21.2.5.22 but each such call matches only at the index supplied in the call.

allenwb commented 9 years ago

@getify

So if I'm getting an understanding of the intent of "sticky" (I guess I haven't yet, to this point), it takes a pattern like /foo/ and implies that for a match to be made, the "f" must be at ONLY the current value of lastIndex (0 by default, or whatever it's set to later), and no further forward (or backward) in the input string.

Is that correct?

yes

If that's how it should work, I'm completely lost on the use cases of it? Also every example (MDN, etc) I've found implies that it's closely related to the repeated g + exec(..) usage, but it clearly is not the same.

It's an adjustable (via lastIndex) left anchor.

getify commented 9 years ago

It's an adjustable (via lastIndex) left anchor.

I guess what I was asking is, if one is parsing a string (of presumably unknown contents -- otherwise why parse?), how does one know what to adjust lastIndex to, if the non-anchoring at the front of the pattern is not allowed to adjust it for you?

In what circumstances do you need sticky when you already know the indexes your matches are at? You can just set the index and use exec(..), and it'll match right there, right? What benefit would y provide in that case?

Sorry for being dense here. It's obvious my own misunderstandings are driving all the confusion in this thread.

neVERberleRfellerER commented 9 years ago

@allenwb Thank you for making everything clear.

First, NOTEs are non-normative

This note was easiest to reference and it was compatible with my understanding that I have got from pattern and notation of continuation and matcher (the last two are not even numbered and as a result difficult to reference). And the clarification provided by it was really useful when I started diging through it, so its purpose was fullfiled, I think :-)

@getify I have found this: http://wiki.ecmascript.org/doku.php?id=harmony:regexp_y_flag One use case with explanation is in Rationale. The page itself is pretty old, but the idea from Proposal is same as in ES6 draft referenced in this conversation, so the same use cases should apply here.

getify commented 9 years ago

also, what then is ES6's expected handling of:

var re = /^o./y,     // <---- notice the ^ here
    str = "oops nope";

re.lastIndex;        // 0
str.match( re );     // ["oo"]
re.lastIndex;        // 2

str.match( re );     // null
re.lastIndex;        // 0

re.lastIndex = 6;
str.match( re );     // ????
neVERberleRfellerER commented 9 years ago

@getify I think that the last result should be null, because input is not starting at position 6.

Also take a look at this example, which I believe is correct:

var str = 'oops noaa\noops nobb\noops nocc';
var re = /^oops .o..\n?/ym; // notice the 'm' and different pattern

re.lastIndex = 10; // new line is starting at position 10

re.exec(str); // ['oops nobb\n']
re.lastIndex; // 20

re.exec(str); // ['oops nocc']
re.lastIndex; // 29

re.exec(str); // null
re.lastIndex; // 0

re.exec(str); // ['oops noaa\n']
re.lastIndex; // 10

This does not work with /g and repeated exec. Update It does. So, the adjustable left anchor is really the only advantage of this?

getify commented 9 years ago

@neVERberleRfellerER

This does not work with /g and repeated exec.

It seems to work for me (chrome) with /gm:

exec + gm

I don't see how y provides a benefit there? Did I miss it?

neVERberleRfellerER commented 9 years ago

@getify My bad. I ahve missed /m when I have been testing this. I do not know. Maybe the efficiency in some cases is the only advantage.

allenwb commented 9 years ago

see NOTE at end of 21.2.2.6

@neVERberleRfellerER

Maybe the efficiency in some cases is the only advantage.

Probably. I wasn't the feature champion, I just had to figure out how to integrate it into the spec.

getify commented 9 years ago

Maybe the efficiency in some cases is the only advantage.

Probably.

Yikes, I hope not.

Let me go back to some of my earliest assertions about the benefits of y... do these hold water?

  1. y can (partially) be emulated with g + exec(..), but the use of the g flag changes other regex-aware utilities (like String#match(..)) in perhaps-unwanted ways (match(/../g) grabs all matches all at once, instead of one-at-a-time), whereas match(/../y) will give you the sticky like benefits (lastIndex) without changing the one-at-a-time behavior of match(..).
  2. y can be seen as to restrain a behavior that's not possible to restrain with g + exec(..), which is that g + exec(..) will automatically "look forward" in the input string from wherever lastIndex currently is, perhaps finding it past lastIndex. y says "no, no, you can only find this at exactly lastIndex, or not at all".

I think those are faithful explanations of the difference/benefit of y. Sanity check?

getify commented 9 years ago

@allenwb

see NOTE at end of 21.2.2.6

Yep, that's exactly the kind of clarity I needed on the ^ issue. That's put to rest now. Thanks!

getify commented 9 years ago

FWIW, I've completely rewritten (and expanded) my book section on "sticky mode" after all these explorations. If it helps at all, here how I went about it:

https://github.com/getify/You-Dont-Know-JS/blob/master/es6%20&%20beyond/ch2.md#sticky-flag

sebmck commented 9 years ago

Closing this now since I've basically solved this on the Babel side and it's now up to a polyfill to patch the RegExp constructor. Thanks everyone for the insightful discussion!