PHPCSStandards / PHP_CodeSniffer

PHP_CodeSniffer tokenizes PHP files and detects violations of a defined set of coding standards.
BSD 3-Clause "New" or "Revised" License
880 stars 54 forks source link

4.0 | Proposal: improve consistency for PHP open tag tokenization #593

Open jrfnl opened 1 month ago

jrfnl commented 1 month ago

Current Situation

Inspired by #588

As things are, the tokenization of PHP open tag tokens is taken from PHP itself, without changes, but the tokenization in PHP itself is inconsistent in regards to how whitespace following the actual tag is tokenized.

Open tag / followed by <?php <? (with short_open_tags=on) <?=
New line new line included in T_OPEN_TAG T_OPEN_TAG + T_WHITESPACE T_OPEN_TAG_WITH_ECHO + T_WHITESPACE
Trailing spaces + new line one space included, followed by T_WHITESPACE T_OPEN_TAG + T_WHITESPACE T_OPEN_TAG_WITH_ECHO + T_WHITESPACE
No space ** T_OPEN_TAG T_OPEN_TAG_WITH_ECHO
One space space included in T_OPEN_TAG T_OPEN_TAG + T_WHITESPACE T_OPEN_TAG_WITH_ECHO + T_WHITESPACE
Multiple spaces one space included, followed by T_WHITESPACE T_OPEN_TAG + T_WHITESPACE T_OPEN_TAG_WITH_ECHO + T_WHITESPACE

** How the <?php tag without space after it is tokenized depends on whether short_open_tags=on. With short_open_tags=off: tokenized as T_INLINE_HTML. With short_open_tags=on: tokenized as T_OPEN_TAG <? + T_STRING for php + whatever came after.

:point_up: In my opinion, the above is correct and will not be changed as <?php without any space after it, should be considered a parse error anyway.

The Problem

The short of it, is that short open tags - <? - and short open echo tags - <?= - are tokenized consistently as just the plain tag without any whitespace inclusion.

But the token for long open tags - <?php - may include either a new line or a single space after the tag itself, but only that. Any additional whitespace after that is tokenized as a separate T_WHITESPACE token.

This inconsistency means that sniff writers need to remember this unintuitive peculiarity and have to take it into account whenever a sniff needs to deal with PHP open tags.

This also makes the code in these type of sniffs more complicated and increases the cognitive load.

I suspect there will be numerous sniffs out there (including in PHPCS itself) which don't take the difference in tokenization between the long and short open tags into account correctly. This will, in most cases, only be problematic for files using short open tags, as most sniffs default to expecting long open tags.

However, this also means that those sniffs will be buggy if they are used to scan files with short open tags (with short_open_tags=on).

Proposal

I'd like to propose to streamline the tokenization of PHP open tags to never include whitespace.

In practice, only the tokenization of long open tags <?php is affected by this change proposal, as those would no longer include a new line/single space.

I expect the impact of this change on existing code to be relatively low as there are not that many sniffs which specifically deal with PHP open tags. I also expect that the majority of the affected sniffs will be PHPCS itself (as external standards rarely have the need for extra sniffs for open tags as PHPCS itself has most of the things which could be examined covered via the existing sniffs).

Planning

As this would be a breaking change for sniffs which examine the T_OPEN_TAG token, this change would go into PHPCS 4.0.

The change going into PHPCS 4.0 also means we don't have to consider the impact of this change on CSS and JS tokenization and sniffs anymore.

Opinions

Please let me know if you have any concerns about this proposal or any suggestions for alternative approaches.

/cc @asispts @dingo-d @fredden @GaryJones @greg0ire @kukulich @michalbundyra @Ocramius @sirbrillig @stronk7 @weierophinney @wimg

jrfnl commented 1 month ago

Looks like the PHP close tag also may include whitespace. As there is only one flavour of PHP close tags, I'm not sure anything should change for close tags though.