cgag / loc

Count lines of code quickly.
MIT License
2.29k stars 126 forks source link

Count for python is wildly inaccurate #111

Open olivren opened 5 years ago

olivren commented 5 years ago

I tried this tool for the very first time, to count the number of lines of code of a Python project. The numbers it reports are shockingly inaccurate. It reports a correct number of total lines and blank lines, but it over-counts the number of comments.

I investigated a bit, and I found a simple example that reports 6 lines of comments and 0 lines of code:

'''
This is a module docstring
'''
a = 1
b = 2
c = 3

So, loc correctly tries to match the docstring delimited by 3 simple quotes, and ends up matching the whole file.

Additional notes about Python comments

In Python, '''hello''' and """hello""" are string literals, but they are considered a docstring comment only if they appear at the top level of the file, or in a class or function definition. A good heuristic to tell them apart is to count only the triple-quoted string literals that start at the beginning of a line (not counting the blanks).

Here is another example where loc counts 2 lines of comment and 1 line of code:

a = '''hello
world
'''

And another one that counts 6 lines of code:

"""
This is a module docstring
"""
a = 1
b = 2
c = 3

For what is worth, tokei is not better as it ignores docstring comments entirely (which is a very poor choice in my opinion).

boyter commented 5 years ago

Not trying to hijack the conversation away from loc, @olivren did you try https://github.com/boyter/scc as a comparison? I belive it handles all these cases as you would expect.

I ask because I keep an eye on all of the counters and try to add any issues into its test suite to make it as accurate as possible.

olivren commented 5 years ago

@boyter I just tried with scc 2.2.0, and it does not handle docstrings at all. I opened an issue about that https://github.com/boyter/scc/issues/62

olivren commented 5 years ago

Errata: I previously said that Tokei ignores docstring comments (and by that I meant it considers it as code). This is in fact the default behavior, but Tokei has a configuration that triggers the correct behavior of counting all docstrings as comments (treat_doc_strings_as_comments = true in tokei.toml).