c2nes / javalang

Pure Python Java parser and tools
MIT License
737 stars 161 forks source link

Option to ignore errors on tokenization. #51

Closed cassianomonteiro closed 6 years ago

cassianomonteiro commented 6 years ago

Option to ignore errors on tokenization. Useful when parsing snippets of code instead of complete block/files.

c2nes commented 6 years ago

Could you help me understand the use case for this a bit more? The test case given is a bit odd since it appears to be tokenizing a fragment of a block comment (which would normally be treated as a single "token"). There are also a few other cases in the tokenizer which would need updates to support ignoring errors (including strings and some number literals).

cassianomonteiro commented 6 years ago

I'm using this to tokenize snippets of code and patches... So usually I get fragments which are not exactly complete pieces of code, but rather one-line changes. That's why this test case is just a fragment of a comment, and not the complete thing.

Deathnerd commented 6 years ago

I had a use case for this a few weeks ago but the details have left me... +1 for this as I can definitely see the benefits of parsing incomplete snippets. @cassianomonteiro sounds like you're automating your code review with this?

cassianomonteiro commented 6 years ago

@Deathnerd I'm researching vulnerability detection in Android libraries. In my current project, I'm trying to detect known vulnerabilities in other versions of a library using machine learning. To calculate features for my prediction model, I'm using java tokens from fix patches. That's why I'm trying to parse snippets of code.

Deathnerd commented 6 years ago

@cassianomonteiro That sounds seriously awesome. I wish you luck in your endeavors!

cassianomonteiro commented 6 years ago

@c2nes Thanks for reviewing this! I'm a little busy these days, but I will try to work on it as soon as possible.

c2nes commented 6 years ago

Looks good to me. Thanks agian @cassianomonteiro!