Open rr- opened 10 years ago
Yes, I've already had some false positives with this rule. The analysis done is too simple.
This is where a real AST could help a lot.
I'd say that #32 is related to this.
https://github.com/nikic/PHP-Parser
I found decent AST, but it ignores whitespace. Apparently, there is an issue for this but it hasn't been worked on for two years.
@rr- yeah, @nikic's PHP-Parser is something I've looked at. I've not had any time to actually look at implementing it - it'll definitely fix a lot of our problems.
@rr- I'm going to try and spend a bit of time looking at PHP-Parser if I can. The introduction says:
There are other ways of processing source code. One that PHP supports natively is using the token stream generated by
token_get_all
. The token stream is much more low level than the AST and thus has different applications: It allows to also analyze the exact formatting of a file. On the other hand the token stream is much harder to deal with for more complex analysis. For example an AST abstracts away the fact that in PHP variables can be written as$foo
, but also as$$bar
,${'foobar'}
or even${!${''}=barfoo()}
. You don't have to worry about recognizing all the different syntaxes from a stream of tokens.
The bit I'm interested is in bold.
So token_get_all
(which is what we use) is great but so is using an AST. I've got a feeling we're going to have to mix and match here.
Another way to go may be to look at how Scrutinzer works.
Yes, using tokenizer would be cool to do checks like whitespace around operators, etc. AST should be used for more complex checks, like unused variables and code. Trying to figure them out from flat token list would be too complicated.
In each use case, I'd go with whatever makes the code most maintainable.
Note that token_get_all
ignores a lot of tokens, AFAIR including whitespace. The best option here would be to write a tokenizer yourself. An example tokenizer could look like this. It is based on regular expressions and longest match rule. As you see, it's not that complex. I tested it only on itself, so it is by no means complete.
If implemented carefully, it probably would decrease size of existing codebase twofold without breaking current functionality, which is a good thing.
PHPCheckstyle does add some missing tokens in to account for those which aren't included.
I've started a new branch which you're more than welcome to contribute to.
As an additional consideration, it should be added that the php-parser is much slower than token_get_all. I don't know how much slower off the top of my head, but at least 10 times. So including it is probably only worth the slowdown if you're going to do multiple complex inspections.
It's possible to switch from the syntax tree to the underlying tokens with a bit of extra code. By using a custom token offset lexer you can get the start and end offsets for a particular AST node in the token array. I'm using this approach to inspect the exact formatting of a node or do changes to code without breaking formatting. Sadly I don't have open-source examples for this right now.
I confirm that currently PHPCheckstyle use token_get_all() with some modifications to take into account spaces, tabs and carriage return and to count the line returns in the inlined HTML.
I've been very interested in PHP-Parser for a long time (thanks nikic, it's a nice job you've done) but no time to try to implement something.
We should also followx what will happend with PHP 7 and its AST.
Consider this example:
PHPCheckstyle warns me about it:
even though it isn't true. To make the warning go away, I need to refactor the code like this: