Open lpmi-13 opened 2 years ago
the list of keywords for any language is a very small fixed set, so those should be easy to analyze in a general text-processing approach.
this gets tricky to analyze with an AST. by the time you start doing enough regex/heuristic guesswork to filter out what's a key word in a comment vs. in active code, you might as well have just parsed it. But you can parse then regenerate removing all comments along the way, then it would be easier to do a purely textual analysis. but you still might mistake instances of code-like strings
Is it more of an AST-type problem (meaning we need to stick with JS for JS files...?
Or is it more of a general text-processing problem (meaning we don't need to have everything in JS)...
I'd say by any means necessary. different methods will be better suited to different questions.
and there are JS AST parsers written in other languages, so even if we are limiting ourselves to analyzing JS programs our analyses don't need to be written in JS. we can choose based on the libraries/tools available or our own preferences
parse out the functions using the AST parser and then count things a la NLTK (would there be any value in eventually creating something like NLTK, but for code?).
we won't know till we try. but I'd assume there is
Is it more of an AST-type problem (meaning we need to stick with JS for JS files...?
Or is it more of a general text-processing problem (meaning we don't need to have everything in JS)...
I'm assuming it might be more of the latter, but I still can't find any evidence of anyone actually having done any of this, so it's really uncharted territory.
Some ideas: