Will leverage existing tools to parse code into ASTs and extract properties of the trees to make additional features. Will add details to this as I look into 1) existing tools, 2) useful properties of ASTs, 3) how to leverage the tools with reasonable complexity.
By the end of this, should be able to generate additional features and explore them with regards to our classification pb.
Hoping this would make our models more robust to spurious choice of words (same function names across all files in a library, or in a crypto competiton (a lot of competitions enforce interfaces to test the solutions easily , e.g. encryptdecrypt functions are giveaways, especially problematic when 10% of our positive examples have them)
Will leverage existing tools to parse code into ASTs and extract properties of the trees to make additional features. Will add details to this as I look into 1) existing tools, 2) useful properties of ASTs, 3) how to leverage the tools with reasonable complexity.
By the end of this, should be able to generate additional features and explore them with regards to our classification pb.
Hoping this would make our models more robust to spurious choice of words (same function names across all files in a library, or in a crypto competiton (a lot of competitions enforce interfaces to test the solutions easily , e.g.
encrypt
decrypt
functions are giveaways, especially problematic when 10% of our positive examples have them)