string literals lost during tokenization for C/C++ code.

cedricrupb / code_tokenize

Fast tokenization and structural analysis of any programming language

MIT License

43 stars 8 forks source link

string literals lost during tokenization for C/C++ code. #4

Open 5c4lar opened 2 years ago

5c4lar commented 2 years ago

For the following code

import code_tokenize as ctok
sample = """
#include <stdio.h>
int main() {
    printf("hello world");
}
"""
ctok.tokenize(sample, lang = "cpp")

Output:

[#include, <stdio.h>, , int, main, (, ), {, printf, (, ", ", ), ;, }]

But parsing string literals works fine for Java and Python code. How should I fix this problem?

5c4lar commented 2 years ago

Use a custom visitor like this:

class CLeafVisitor(ctok.tokenizer.LeafVisitor):
    def visit_string_literal(self, node):
        self.node_handler(node)
        return False

seems to fix the problem

cedricrupb commented 2 years ago

Thank you for this hint!

I will add more custom visitors for the supported languages in the next release. Until then, you can use custom visitors to parse your code. For example, you could use your C visitor as follows:

ctok.tokenize(sample, lang = "cpp", visitors=[CLeafVisitor])