Open jrfnl opened 6 days ago
Loosely related to: https://github.com/squizlabs/PHP_CodeSniffer/issues/3808
Okay, so I've had a look at this and as I suspected there are some issues which should be fixed. I'm currently undecided on how to fix these.
I'm fully aware that the issues found are edge cases and will rarely, if ever, be found in real life codebases. Having said that, that's all the more reason to fix these tokenizer issues as PHPCS kicking in and flagging things correctly is especially important for those edge cases. Issue https://github.com/squizlabs/PHP_CodeSniffer/issues/3808 is a case in point for this.
Guiding principles:
Additionally:
content
should be replaced by the appropriate number of spaces (and the original content with tabs will be made available in a orig_content
key).For "plain" yield
, everything is fine.
For "plain" yield from
everything is fine.
Case-insensitivity of the keywords is also handled correctly.
Things get interesting once we add a comment/ignore annotation between the yield
and the from
keywords and/or have a tab or a new line character (and indentation whitespace) between the keywords.
I've used the following code sample to examine the tokenization:
<?php
function generator()
{
yield from gen2();
yield /* comment */ from gen2();
yield
from
gen2();
yield // comment
from gen2();
yield
/* comment */
from
gen2();
}
The above code sample is valid in PHP 8.3, but shows parse errors in PHP < 8.3 for those code samples with a comment between the yield
and the from
keywords.
https://3v4l.org/2SI2Q#veol
Taking the PHP native tokenization as a basis, there are four (five) different tokenizations to take into account.
yield
didn't exist as a keyword.yield
did exist, but yield from
did not.yield
as well as yield from
exist.yield from
keyword was changed (improved/fixed).To summarize: it looks like PHP natively did not support a comment between the yield
and from
keywords in PHP 7.0 - 8.2 and does since PHP 8.3.
As things are, the PHPCS tokenization is consistent for PHP 5.4 - 8.2 and in line with a comment between the keywords being a parse error. For PHP 8.3, the PHPCS tokenization is different due to the PHP 8.3 native tokenization having changed.
The PHPCS tokenization is as follows:
yield from gen2();
# PHP 5.4 - 8.2:
# 10 | L04 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 10]: yield from
#
# PHP 8.3:
# 10 | L04 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 10]: yield from
yield /* comment */ from gen2();
# PHP 5.4 - 8.2:
# 19 | L06 | C 5 | CC 1 | ( 0) | T_YIELD | [ 5]: yield
# 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
# 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 23 | L06 | C 25 | CC 1 | ( 0) | T_STRING | [ 4]: from
#
# PHP 8.3:
# 19 | L06 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 24]: yield /* comment */ from
yield
from
gen2();
# PHP 5.4 - 8.2:
# 32 | L08 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
#
# 33 | L09 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 8]: from
#
# PHP 8.3:
# 28 | L08 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
#
# 29 | L09 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 8]: from
yield // comment
from gen2();
# PHP 5.4 - 8.2:
# 43 | L12 | C 5 | CC 1 | ( 0) | T_YIELD | [ 5]: yield
# 44 | L12 | C 10 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 45 | L12 | C 11 | CC 1 | ( 0) | T_COMMENT | [ 10]: // comment
#
# 46 | L13 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 47 | L13 | C 2 | CC 1 | ( 0) | T_STRING | [ 4]: from
#
# PHP 8.3:
# 39 | L12 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 16]: yield // comment
#
# 40 | L13 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: from
yield
/* comment */
from
gen2();
# PHP 5.4 - 8.2:
# 56 | L15 | C 5 | CC 1 | ( 0) | T_YIELD | [ 5]: yield
# 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE | [ 0]:
#
# 58 | L16 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 4]:
# 59 | L16 | C 5 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
# 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE | [ 0]:
#
# 61 | L17 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 4]:
# 62 | L17 | C 5 | CC 1 | ( 0) | T_STRING | [ 4]: from
#
# PHP 8.3:
# 49 | L15 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
#
# 50 | L16 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 17]: /* comment */
#
# 51 | L17 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 8]: from
To summarize:
yield
and from
in a yield from
with a comment between the keywords as T_YIELD
and T_STRING
respectively, and tokenizes the comment as as T_COMMENT
and treats any whitespace as T_WHITESPACE
.yield from
with a comment between the keywords as one T_YIELD_FROM
token (possibly split into multiple tokens when multi-line) and includes the comment in the T_YIELD_FROM
.As things currently are, there are the following issues in the PHPCS polyfill/re-tokenization:
yield
and from
keywords.
This means that sniffs which check for indentation/inline tabs will not flag tabs between the keywords or indentation tabs when the keywords are across multiple lines (though those sniffs may need updating too to listen to the T_YIELD_FROM
token).
Sniffs which will need tests/updates for this: Generic.WhiteSpace.DisallowTabIndent
, Generic.WhiteSpace.LanguageConstructSpacing
and (PHPCSExtra) Universal.WhiteSpace.DisallowInlineTabs
.yield
and from
keywords are not being handled consistently and that the keywords are not always tokenized consistently/correctly.How to consistently tokenize yield from
with a comment and/or new line between the keywords, knowing this is a parse error in PHP < 8.3, is a little challenging.
There are three options I can see:
Adhere to the PHP 8.3 tokenization - tokenize everything from yield
to from
as T_YIELD_FROM
.
This will make it more difficult for comment sniffs to act on potential comments in the token. It would also prohibit/disregard PHPCS annotations inside the yield from
.
Potentially the LanguageConstructSpacing
sniff could be updated to flag/forbid comments in yield from
.
yield from gen2();
# 10 | L04 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 10]: yield from
yield /* comment */ from gen2();
# 19 | L06 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 24]: yield /* comment */ from
yield
/* comment */
from
gen2();
# 49 | L15 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
#
# 50 | L16 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 17]: /* comment */
#
# 51 | L17 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 8]: from
Tokenize as a single token when there is only non-new line whitespace between the keywords, tokenize as multiple tokens in all other cases, but consistently tokenize the yield
and from
in those cases as T_YIELD_FROM
.
yield from gen2();
# 10 | L04 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 10]: yield from
yield /* comment */ from gen2();
# 19 | L06 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
# 20 | L06 | C 10 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 21 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
# 22 | L06 | C 24 | CC 1 | ( 0) | T_WHITESPACE | [ 1]:
# 23 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM | [ 4]: from
yield
/* comment */
from
gen2();
# 56 | L15 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
# 57 | L15 | C 10 | CC 1 | ( 0) | T_WHITESPACE | [ 0]:
#
# 58 | L16 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 4]:
# 59 | L16 | C 5 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
# 60 | L16 | C 18 | CC 1 | ( 0) | T_WHITESPACE | [ 0]:
#
# 61 | L17 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 4]:
# 62 | L17 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 4]: from
Hybrid - tokenize the yield
and from
keywords and any directly related whitespace as T_YIELD_FROM
, but tokenize comments separately.
This would be more in line with how other potential multi-line tokens, like T_COMMENT
and T_CONSTANT_ENCAPSED_STRING
are handled, but makes the whitespace handling around (single-line) yield from
with a comment a bit weird/inconsistent as those would now potentially include whitespace.
yield from gen2();
# 10 | L04 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 10]: yield from
yield /* comment */ from gen2();
# 19 | L06 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 6]: yield
# 20 | L06 | C 11 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
# 21 | L06 | C 25 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: from
yield
/* comment */
from
gen2();
# 49 | L15 | C 5 | CC 1 | ( 0) | T_YIELD_FROM | [ 5]: yield
#
# 50 | L16 | C 1 | CC 1 | ( 0) | T_WHITESPACE | [ 4]:
# 51 | L16 | C 5 | CC 1 | ( 0) | T_COMMENT | [ 13]: /* comment */
#
# 52 | L17 | C 1 | CC 1 | ( 0) | T_YIELD_FROM | [ 8]: from
Independently of the chosen solution, tab replacement should be enabled for the T_YIELD_FROM
token, though whether this will only act on yield[tab]from
or also on potential indentation whitespace will depend on the chosen solution.
Aside from the above, it will probably be a good idea to add a sniff to PHPCompatibility to check for comments between the yield
and from
keywords and flag that as unsupported prior to PHP 8.3. I've opened issue https://github.com/PHPCompatibility/PHPCompatibility/issues/1720 for this.
I'd love to hear some opinions on the above and preferences for the solution directions (or even potential alternative solutions which I haven't thought of yet...)
I'm personally leaning towards solution 2️⃣ , but am still not 100% happy with this.
/cc @asispts @dingo-d @fredden @GaryJones @greg0ire @kukulich @michalbundyra @Ocramius @sirbrillig @stronk7 @weierophinney @wimg
Original polyfill discussion and PR tickets:
Another option not listed:
yield
and from
as separate tokens (both T_YIELD_FROM
), and tokenise anything in between "normally." This is in line with how a function declaration is tokenised: the function
keyword (T_FUNCTION
) doesn't include any white-space, the actual white space is tokenised as T_WHITESPACE
and the name of the function (T_STRING
) also doesn't include any white-space.Ptr | Ln | Col | Cond | ( #) | Token Type | [len]: Content
-------------------------------------------------------------------------
0 | L1 | C 1 | CC 0 | ( 0) | T_OPEN_TAG | [ 5]: <?php
1 | L2 | C 1 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
2 | L3 | C 1 | CC 0 | ( 0) | T_FUNCTION | [ 8]: function
3 | L3 | C 9 | CC 0 | ( 0) | T_WHITESPACE | [ 1]: ⸱
4 | L3 | C 10 | CC 0 | ( 0) | T_STRING | [ 5]: thing
5 | L3 | C 15 | CC 0 | ( 0) | T_OPEN_PARENTHESIS | [ 1]: (
6 | L3 | C 16 | CC 0 | ( 0) | T_CLOSE_PARENTHESIS | [ 1]: )
7 | L3 | C 17 | CC 0 | ( 0) | T_SEMICOLON | [ 1]: ;
8 | L3 | C 18 | CC 0 | ( 0) | T_WHITESPACE | [ 0]:
We can link the two T_YIELD_FROM
tokens together with a property (similar to how we do scope/parenthesis open/close currently).
Proposed:
T_YIELD_FROM yield
T_WHITESPACE ⸱
T_YIELD_FROM from
T_WHITESPACE ⸱
T_STRING thing
T_OPEN_PARENTHESIS (
T_CLOSE_PARENTHESIS )
T_SEMICOLON ;
@fredden Good input. Thanks for that. I'll have to think it over. Main reason why I didn't include that option in the original list is because if makes the gap between the PHP native tokenization and the PHPCS tokenization larger/makes it more inconsistent.
As for the comparison with function declarations: that's not a comparable syntax as PHP natively does not tokenize the function
keyword and the function name as one token.
The only multi-character tokens which AFAICS come close are cast tokens which can contain arbitrary whitespace on the inside of the parentheses, though those still can't contain new lines or comments. See: https://3v4l.org/A6Sgj and https://3v4l.org/nE9H8
Just a reminder to self to revisit the
yield from
tokenization and add more extensive tests for this.Things to have a critical look at whether they are tokenized correctly/how they are tokenized:
yield
andfrom
with only whitespace between the words, but the words are on different linesyield
andfrom
with a comment between the wordsAlso needs a check on how PHP natively handles this.