cedricrupb / code_tokenize

Fast tokenization and structural analysis of any programming language
MIT License
44 stars 8 forks source link

Unexpected token with no matched text at the end of tokenstream #5

Open LakshyAAAgrawal opened 1 year ago

LakshyAAAgrawal commented 1 year ago

Consider the following code:

text = """private void unlockMap(Player player) {
        TowerData towerData = player.getTowerData();
        if (!towerData.getClass().equals(TowerData.class)) {
            CommandHandler.sendTranslatedMessage(player, "commands.generic.no_permissions");
        } else {
            if (towerData."""

import code_tokenize as ctok
tokens = ctok.tokenize(text, lang='java', syntax_error='ignore')

assert tokens[-1].type == '.', (tokens[-1].type, tokens[-1].text)

The result of executing the above code is:

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[23], line 11
      8 import code_tokenize as ctok
      9 tokens = ctok.tokenize(text, lang='java', syntax_error='ignore')
---> 11 assert tokens[-1].type == '.', (tokens[-1].type, tokens[-1].text)

AssertionError: ('}', '')

There is an additional token of type '}' in the end of the tokenstream, and it doesn't match any text. The tokenstream upto the last token is as expected:

[private,
 void,
 unlockMap,
 ,
 (,
 Player,
 player,
 ),
 {,
 TowerData,
 towerData,
 =,
 player,
 .,
 getTowerData,
 (,
 ),
 ;,
 if,
 (,
 !,
 towerData,
 .,
 getClass,
 (,
 ),
 .,
 equals,
 (,
 TowerData,
 .,
 class,
 ),
 ),
 {,
 CommandHandler,
 .,
 sendTranslatedMessage,
 (,
 player,
 ,,
 "commands.generic.no_permissions",
 ),
 ;,
 },
 else,
 {,
 if,
 (,
 towerData,
 .,
 ]
LakshyAAAgrawal commented 1 year ago

Noticed that the the fourth token in the above stream is also empty '' and of type ';' which is unexpected

cedricrupb commented 1 year ago

Hey! Thank you for pointing this out!

Note that code_tokenize always tries to construct the AST/CST (based on tree-sitter) before tokenization. Since tree-sitter is a best-effort parser, it might inject nodes to match the grammar which sometimes end up in the token stream.

If you parse a syntactically incorrect program, you can easily filter these fake nodes by removing all tokens that are marked as error nodes:

 token.ast_node.has_error # Returns True for error nodes and False otherwise
LakshyAAAgrawal commented 1 year ago

Thanks a lot for your reply. Apart from your suggested check above, token.ast_node.has_error, is it okay to remove tokens that do not match any text?

[t for t in ctok.tokenize(code_text, lang='java', syntax_error='ignore') if t.text != '']
cedricrupb commented 1 year ago

Hey! I think you should be fine for now since no real token should match any empty string. However, if you want to be safe, I would still go with checking for error nodes.