anvaka / common-words

visualization of common words in different programming languages
https://anvaka.github.io/common-words
MIT License
505 stars 27 forks source link

Strip comments #7

Open DJTB opened 7 years ago

DJTB commented 7 years ago

Hey hey, I love what you've done here!

It seems a bit ridiculous though that “the” is in the top 10 (edit: for Javascript at least), when the occurrences are all(?) from comments. Would be great to see a dataset that doesn't include comments.

anvaka commented 7 years ago

Thank you!

I think having ability to parse actual code and categorize each line is a very powerful idea. However, that would be too expensive/time consuming for me to do. I guess one way to do so, would be to translate each file into language-specific abstract syntax tree using user defined functions in BigQuery, and then emit categorized lines.

Or maybe there is an easier way?

DJTB commented 7 years ago

I'm not sure about other languages, but for web related tech you could run everything first through something like https://github.com/vitaly-t/decomment