PHPCSStandards / PHP_CodeSniffer

PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.
BSD 3-Clause "New" or "Revised" License
786 stars 47 forks source link

Reminder: revisit `yield from` tokenization #529

Open jrfnl opened 6 days ago

jrfnl commented 6 days ago

Just a reminder to self to revisit the yield from tokenization and add more extensive tests for this.

Things to have a critical look at whether they are tokenized correctly/how they are tokenized:

Also needs a check on how PHP natively handles this.

jrfnl commented 4 days ago

Loosely related to: https://github.com/squizlabs/PHP_CodeSniffer/issues/3808

jrfnl commented 3 days ago

Okay, so I've had a look at this and as I suspected there are some issues which should be fixed. I'm currently undecided on how to fix these.

I'm fully aware that the issues found are edge cases and will rarely, if ever, be found in real life codebases. Having said that, that's all the more reason to fix these tokenizer issues as PHPCS kicking in and flagging things correctly is especially important for those edge cases. Issue https://github.com/squizlabs/PHP_CodeSniffer/issues/3808 is a case in point for this.

Tokenizer basic principles

Guiding principles:

Additionally:

Findings

For "plain" yield, everything is fine. For "plain" yield from everything is fine. Case-insensitivity of the keywords is also handled correctly.

Things get interesting once we add a comment/ignore annotation between the yield and the from keywords and/or have a tab or a new line character (and indentation whitespace) between the keywords.

Test set up

I've used the following code sample to examine the tokenization:

<?php
function generator()
{
    yield from gen2();

    yield /* comment */ from gen2();

    yield
    from
    gen2();

    yield // comment
    from gen2();

    yield
    /* comment */
    from
    gen2();
}

The above code sample is valid in PHP 8.3, but shows parse errors in PHP < 8.3 for those code samples with a comment between the yield and the from keywords. https://3v4l.org/2SI2Q#veol

PHP native tokenization

Taking the PHP native tokenization as a basis, there are four (five) different tokenizations to take into account.


Details about the differences in PHP native tokenization The PHP native tokenization is as follows: ```php yield from gen2(); # PHP 5.4: # Line 4: T_STRING ('yield') # Line 4: T_WHITESPACE (' ') # Line 4: T_STRING ('from') # # PHP 5.5 - 5.6: # Line 4: T_YIELD ('yield') # Line 4: T_WHITESPACE (' ') # Line 4: T_STRING ('from') # # PHP 7.0 - 8.2: # Line 4: T_YIELD_FROM ('yield from') # # PHP 8.3: # Line 4: T_YIELD_FROM ('yield from') yield /* comment */ from gen2(); # PHP 5.4: # Line 6: T_STRING ('yield') # Line 6: T_WHITESPACE (' ') # Line 6: T_COMMENT ('/* comment */') # Line 6: T_WHITESPACE (' ') # Line 6: T_STRING ('from') # # PHP 5.5 - 5.6: # Line 6: T_YIELD ('yield') # Line 6: T_WHITESPACE (' ') # Line 6: T_COMMENT ('/* comment */') # Line 6: T_WHITESPACE (' ') # Line 6: T_STRING ('from') # # PHP 7.0 - 8.2: # Line 6: T_YIELD ('yield') # Line 6: T_WHITESPACE (' ') # Line 6: T_COMMENT ('/* comment */') # Line 6: T_WHITESPACE (' ') # Line 6: T_STRING ('from') # # PHP 8.3: # Line 6: T_YIELD_FROM ('yield /* comment */ from') yield from gen2(); # PHP 5.4: # Line 8: T_STRING ('yield') # Line 8: T_WHITESPACE (' # ') # Line 9: T_STRING ('from') # # PHP 5.5 - 5.6: # Line 8: T_YIELD ('yield') # Line 8: T_WHITESPACE (' # ') # Line 9: T_STRING ('from') # # PHP 7.0 - 8.2: # Line 8: T_YIELD_FROM ('yield # from') # # PHP 8.3: # Line 8: T_YIELD_FROM ('yield # from') yield // comment from gen2(); # PHP 5.4: # Line 12: T_STRING ('yield') # Line 12: T_WHITESPACE (' ') # Line 12: T_COMMENT ('// comment # ') # Line 13: T_WHITESPACE (' ') # Line 13: T_STRING ('from') # # PHP 5.5 - 5.6: # Line 12: T_YIELD ('yield') # Line 12: T_WHITESPACE (' ') # Line 12: T_COMMENT ('// comment # ') # Line 13: T_WHITESPACE (' ') # Line 13: T_STRING ('from') # # PHP 7.0 - 8.2: # Line 12: T_YIELD ('yield') # Line 12: T_WHITESPACE (' ') # Line 12: T_COMMENT ('// comment # ') # Line 13: T_WHITESPACE (' ') # Line 13: T_STRING ('from') # # PHP 8.3: # Line 12: T_YIELD_FROM ('yield // comment # from') yield /* comment */ from gen2(); # PHP 5.4: # Line 15: T_STRING ('yield') # Line 15: T_WHITESPACE (' # ') # Line 16: T_COMMENT ('/* comment */') # Line 16: T_WHITESPACE (' # ') # Line 17: T_STRING ('from') # # PHP 5.5 - 5.6: # Line 15: T_YIELD ('yield') # Line 15: T_WHITESPACE (' # ') # Line 16: T_COMMENT ('/* comment */') # Line 16: T_WHITESPACE (' # ') # Line 17: T_STRING ('from') # # PHP 7.0 - 8.2: # Line 15: T_YIELD ('yield') # Line 15: T_WHITESPACE (' # ') # Line 16: T_COMMENT ('/* comment */') # Line 16: T_WHITESPACE (' # ') # Line 17: T_STRING ('from') # # PHP 8.3: # Line 15: T_YIELD_FROM ('yield # /* comment */ # from') ```

To summarize: it looks like PHP natively did not support a comment between the yield and from keywords in PHP 7.0 - 8.2 and does since PHP 8.3.

PHPCS tokenization

As things are, the PHPCS tokenization is consistent for PHP 5.4 - 8.2 and in line with a comment between the keywords being a parse error. For PHP 8.3, the PHPCS tokenization is different due to the PHP 8.3 native tokenization having changed.

The PHPCS tokenization is as follows:

    yield from gen2();

# PHP 5.4 - 8.2:
# 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
#
# PHP 8.3:
# 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from

    yield /* comment */ from gen2();

# PHP 5.4 - 8.2:
# 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
# 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 23 | L06 | C 25 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 24]: yield /* comment */ from

    yield
    from
    gen2();

# PHP 5.4 - 8.2:
# 32 | L08 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 33 | L09 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from
#
# PHP 8.3:
# 28 | L08 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 29 | L09 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

    yield // comment
    from gen2();

# PHP 5.4 - 8.2:
# 43 | L12 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 44 | L12 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 45 | L12 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 10]: // comment
#
# 46 | L13 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
# 47 | L13 | C  2 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 39 | L12 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 16]: yield // comment
#
# 40 | L13 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]:   from

    yield
    /* comment */
    from
    gen2();

# PHP 5.4 - 8.2:
# 56 | L15 | C  5 | CC 1 | ( 0) | T_YIELD                    | [  5]: yield
# 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
#
# 58 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
# 59 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
# 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
#
# 61 | L17 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
# 62 | L17 | C  5 | CC 1 | ( 0) | T_STRING                   | [  4]: from
#
# PHP 8.3:
# 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
#
# 50 | L16 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [ 17]:     /* comment */
#
# 51 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

To summarize:

Conclusions

As things currently are, there are the following issues in the PHPCS polyfill/re-tokenization:

Possible mitigations

How to consistently tokenize yield from with a comment and/or new line between the keywords, knowing this is a parse error in PHP < 8.3, is a little challenging.

There are three options I can see:

  1. Adhere to the PHP 8.3 tokenization - tokenize everything from yield to from as T_YIELD_FROM. This will make it more difficult for comment sniffs to act on potential comments in the token. It would also prohibit/disregard PHPCS annotations inside the yield from. Potentially the LanguageConstructSpacing sniff could be updated to flag/forbid comments in yield from.

    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 24]: yield /* comment */ from
    
    yield
    /* comment */
    from
    gen2();
    # 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    #
    # 50 | L16 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [ 17]:     /* comment */
    #
    # 51 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from
  2. Tokenize as a single token when there is only non-new line whitespace between the keywords, tokenize as multiple tokens in all other cases, but consistently tokenize the yield and from in those cases as T_YIELD_FROM.

    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    # 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
    # 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE               | [  1]:
    # 23 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM               | [  4]: from
    
    yield
    /* comment */
    from
    gen2();
    # 56 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    # 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
    #
    # 58 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 59 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE               | [  0]:
    #
    # 61 | L17 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 62 | L17 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  4]: from
  3. Hybrid - tokenize the yield and from keywords and any directly related whitespace as T_YIELD_FROM, but tokenize comments separately. This would be more in line with how other potential multi-line tokens, like T_COMMENT and T_CONSTANT_ENCAPSED_STRING are handled, but makes the whitespace handling around (single-line) yield from with a comment a bit weird/inconsistent as those would now potentially include whitespace.

    yield from gen2();
    # 10 | L04 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [ 10]: yield from
    
    yield /* comment */ from gen2();
    # 19 | L06 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  6]: yield
    # 20 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    # 21 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]:  from
    
    yield
    /* comment */
    from
    gen2();
    # 49 | L15 | C  5 | CC 1 | ( 0) | T_YIELD_FROM               | [  5]: yield
    #
    # 50 | L16 | C  1 | CC 1 | ( 0) | T_WHITESPACE               | [  4]:
    # 51 | L16 | C  5 | CC 1 | ( 0) | T_COMMENT                  | [ 13]: /* comment */
    #
    # 52 | L17 | C  1 | CC 1 | ( 0) | T_YIELD_FROM               | [  8]:     from

Independently of the chosen solution, tab replacement should be enabled for the T_YIELD_FROM token, though whether this will only act on yield[tab]from or also on potential indentation whitespace will depend on the chosen solution.

Aside from the above, it will probably be a good idea to add a sniff to PHPCompatibility to check for comments between the yield and from keywords and flag that as unsupported prior to PHP 8.3. I've opened issue https://github.com/PHPCompatibility/PHPCompatibility/issues/1720 for this.

Opinions

I'd love to hear some opinions on the above and preferences for the solution directions (or even potential alternative solutions which I haven't thought of yet...)

I'm personally leaning towards solution 2️⃣ , but am still not 100% happy with this.

/cc @asispts @dingo-d @fredden @GaryJones @greg0ire @kukulich @michalbundyra @Ocramius @sirbrillig @stronk7 @weierophinney @wimg

jrfnl commented 3 days ago

Original polyfill discussion and PR tickets:

fredden commented 2 days ago

Another option not listed:

  1. Consistently tokenise yield and from as separate tokens (both T_YIELD_FROM), and tokenise anything in between "normally." This is in line with how a function declaration is tokenised: the function keyword (T_FUNCTION) doesn't include any white-space, the actual white space is tokenised as T_WHITESPACE and the name of the function (T_STRING) also doesn't include any white-space.
Ptr | Ln | Col  | Cond | ( #) | Token Type                 | [len]: Content
-------------------------------------------------------------------------
  0 | L1 | C  1 | CC 0 | ( 0) | T_OPEN_TAG                 | [  5]: <?php

  1 | L2 | C  1 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

  2 | L3 | C  1 | CC 0 | ( 0) | T_FUNCTION                 | [  8]: function
  3 | L3 | C  9 | CC 0 | ( 0) | T_WHITESPACE               | [  1]: ⸱
  4 | L3 | C 10 | CC 0 | ( 0) | T_STRING                   | [  5]: thing
  5 | L3 | C 15 | CC 0 | ( 0) | T_OPEN_PARENTHESIS         | [  1]: (
  6 | L3 | C 16 | CC 0 | ( 0) | T_CLOSE_PARENTHESIS        | [  1]: )
  7 | L3 | C 17 | CC 0 | ( 0) | T_SEMICOLON                | [  1]: ;
  8 | L3 | C 18 | CC 0 | ( 0) | T_WHITESPACE               | [  0]:

We can link the two T_YIELD_FROM tokens together with a property (similar to how we do scope/parenthesis open/close currently).

Proposed:

T_YIELD_FROM         yield
T_WHITESPACE         ⸱      
T_YIELD_FROM         from  
T_WHITESPACE         ⸱      
T_STRING             thing  
T_OPEN_PARENTHESIS   (      
T_CLOSE_PARENTHESIS  )      
T_SEMICOLON          ;      
jrfnl commented 2 days ago

@fredden Good input. Thanks for that. I'll have to think it over. Main reason why I didn't include that option in the original list is because if makes the gap between the PHP native tokenization and the PHPCS tokenization larger/makes it more inconsistent.

As for the comparison with function declarations: that's not a comparable syntax as PHP natively does not tokenize the function keyword and the function name as one token.

The only multi-character tokens which AFAICS come close are cast tokens which can contain arbitrary whitespace on the inside of the parentheses, though those still can't contain new lines or comments. See: https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8