DeNepo / corpus-analysis-notes

Other
0 stars 0 forks source link

What does it mean to actually "analyze" code? #2

Open lpmi-13 opened 2 years ago

lpmi-13 commented 2 years ago

Is it more of an AST-type problem (meaning we need to stick with JS for JS files...?

Or is it more of a general text-processing problem (meaning we don't need to have everything in JS)...

I'm assuming it might be more of the latter, but I still can't find any evidence of anyone actually having done any of this, so it's really uncharted territory.

Some ideas:

colevandersWands commented 2 years ago

the list of keywords for any language is a very small fixed set, so those should be easy to analyze in a general text-processing approach.

this gets tricky to analyze with an AST. by the time you start doing enough regex/heuristic guesswork to filter out what's a key word in a comment vs. in active code, you might as well have just parsed it. But you can parse then regenerate removing all comments along the way, then it would be easier to do a purely textual analysis. but you still might mistake instances of code-like strings

colevandersWands commented 2 years ago

Is it more of an AST-type problem (meaning we need to stick with JS for JS files...?

Or is it more of a general text-processing problem (meaning we don't need to have everything in JS)...

I'd say by any means necessary. different methods will be better suited to different questions.

and there are JS AST parsers written in other languages, so even if we are limiting ourselves to analyzing JS programs our analyses don't need to be written in JS. we can choose based on the libraries/tools available or our own preferences

colevandersWands commented 2 years ago

parse out the functions using the AST parser and then count things a la NLTK (would there be any value in eventually creating something like NLTK, but for code?).

we won't know till we try. but I'd assume there is