Open DJTB opened 7 years ago
Thank you!
I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.
Or maybe there is an easier way?
I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment
Hey hey, I love what you've done here!
It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments. Would be great to see a dataset that doesn't include comments.