Open LakshyAAAgrawal opened 1 year ago
Noticed that the the fourth token in the above stream is also empty '' and of type ';' which is unexpected
Hey! Thank you for pointing this out!
Note that code_tokenize always tries to construct the AST/CST (based on tree-sitter) before tokenization. Since tree-sitter is a best-effort parser, it might inject nodes to match the grammar which sometimes end up in the token stream.
If you parse a syntactically incorrect program, you can easily filter these fake nodes by removing all tokens that are marked as error nodes:
token.ast_node.has_error # Returns True for error nodes and False otherwise
Thanks a lot for your reply. Apart from your suggested check above, token.ast_node.has_error
, is it okay to remove tokens that do not match any text?
[t for t in ctok.tokenize(code_text, lang='java', syntax_error='ignore') if t.text != '']
Hey! I think you should be fine for now since no real token should match any empty string. However, if you want to be safe, I would still go with checking for error nodes.
Consider the following code:
The result of executing the above code is:
There is an additional token of type '}' in the end of the tokenstream, and it doesn't match any text. The tokenstream upto the last token is as expected: