doctrine / lexer

Base library for a lexer that can be used in Top-Down, Recursive Descent Parsers.
https://www.doctrine-project.org/projects/lexer.html
MIT License
11.07k stars 60 forks source link

Add token index #12

Closed instabledesign closed 6 years ago

instabledesign commented 6 years ago

Hi, i recently work on new project Xpression

My need is to resetPosition at token index but the $token['position'] was the string position of this token in the input string.

My actual workaround was to keep the lexerI index in my Parser code and reset it each time i need it.

So i think if the token index was store in the token i can get it easily with $token['index']

Thank to read.

instabledesign commented 6 years ago

Ping @beberlei @stof just to have answer to process or close it..

stof commented 6 years ago

I'm not maintainer on this project

instabledesign commented 6 years ago

ok sorry

jwage commented 6 years ago

Thanks for the change. I think this makes sense and it should be BC. Can you add a unit test?

jwage commented 6 years ago

Thanks for making this change and adding the tests! I will merge a little later and I am going to follow this PR up with another to add Travis CI.

Majkl578 commented 6 years ago

Unfortunately this change is a BC break and needs to be reverted. :/ It has completely broken egeloen/ivory-serializer: https://gist.github.com/Majkl578/8c1500fd14884091e65e6af3ddef5c84

Thanks @goetas for spotting this!

jwage commented 6 years ago

@Majkl578 I reverted this here https://github.com/doctrine/lexer/pull/18

I kept the other changes from the PR and just reverted the changed functionality.

jwage commented 6 years ago

Looking at the code in https://github.com/egeloen/ivory-serializer/tree/master/src/Type and I don't immediately see what it was depending on that caused the break. I will look more later.

instabledesign commented 6 years ago

Hi, i think we get drop the

$this->tokens[$index] = array(

but we can keep

'index' => $index,
jwage commented 6 years ago

@instabledesign If you have time, can you look at https://github.com/egeloen/ivory-serializer and see why it broke after this change so that we can add tests to cover it?

instabledesign commented 6 years ago

Yes.

instabledesign commented 6 years ago

Investigation report:

Working solution : change the egeloen/ivory-serializer for(...)

for ($i = 0; ($i < $count) || ($token === $nextToken); ++$i) {
for ($i = 0; ($i < $count) || ($token['value'] === $nextToken['value'] && $token['type'] === $nextToken['type'] && $token['position'] === $nextToken['position']); ++$i) {

I'll try to fix the AbstractLexer::$index in order to increment only when the match is a not a capture of previous one but without succeed, and i dont think is a good solution.

instabledesign commented 6 years ago

I try to fix the 2 problem from above but theire is some logic to build the fixtures with some private method, so this is not easy to reproduce and fix correctly what is going on! I continue to work on it on my free time.

instabledesign commented 6 years ago

tests fixed i ping him in order to merge

instabledesign commented 6 years ago

egeloen/ivory-serializer look like not active anymore with only one release (from jan 2017)

@jwage did you plan to create new version with this modification?

jwage commented 6 years ago

Did we figure out a way to make the change in this repo so it doesn't break existing implementations? (even if their regex is "wrong")

instabledesign commented 6 years ago

First i try with group naming but the group naming doesn't work with preg_split

preg_split('/(?<FOO>=|>|<)|(?<BAR>[a-z]+)|(?<BAZ>\d+)/i', 'price>5', -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_OFFSET_CAPTURE);
/*
array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(5) "price"
    [1]=>
    int(0)
  }
  [1]=>
  array(2) {
    [0]=>
    string(1) ">"
    [1]=>
    int(5)
  }
  [2]=>
  array(2) {
    [0]=>
    string(1) "5"
    [1]=>
    int(6)
  }
}
*/

The second way is to deduplicate the matched element with offset

$matches = preg_split('/((=|>|<)|([a-z]+)|(\d+))/i', 'price>5', -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE | PREG_SPLIT_OFFSET_CAPTURE);
$offset = null;
$matchesDeduplicate = array_filter($matches, function($item)use(&$offset){
    if (null === $offset) {
        $offset = $item[1];
        return true;
    }
    $filter = $offset !== $item[1];
    $offset = $item[1];

    return $filter;
});
/*
array(3) {
  [0]=>
  array(2) {
    [0]=>
    string(5) "price"
    [1]=>
    int(0)
  }
  [2]=>
  array(2) {
    [0]=>
    string(1) ">"
    [1]=>
    int(5)
  }
  [4]=>
  array(2) {
    [0]=>
    string(1) "5"
    [1]=>
    int(6)
  }
}
*/

With 300 tokens match 10 each (9000 tokens) we already have 1Mo memory more consuption preg_split without deduplicate preg_split with deduplicate

instabledesign commented 6 years ago

what did you think about it @jwage

jwage commented 6 years ago

I don't think we can make this change without breaking BC or increasing memory usage as you noted.

instabledesign commented 11 months ago

Hi can you consider apply this change on the v2 ? it was originally revert because of breaking unmaintained libs. @greg0ire

greg0ire commented 11 months ago

If there is a breaking change, then it should go into v4 I'm afraid.

instabledesign commented 10 months ago

Im not completely sure it was a BC because it only add a new value in the token details. Like a explain earlier, the egeloen/ivory-serializer not use properly the lexer, so my change alter his behavior. For me it was ok is this change go in V3 or at least in V4.

greg0ire commented 10 months ago

If it's not a breaking change then you should target v3.1