cgag / loc

Count lines of code quickly.
MIT License
2.31k stars 126 forks source link

discrepancy reported between loc and cloc on the exact same repo #15

Open ye opened 7 years ago

ye commented 7 years ago

I just cloned and cargo built loc from the repo. Here are the results I've tested on the exact same test code base.

As you can see there are quite significant discrepancies reported by the two programs. If loc were a re-implementation of cloc then I would expect the discrepancies to be small if there were any.

$ loc -V
count 0.1
$ loc codebase
---------------------------------------------------------------------------------
 Language              Files        Lines        Blank      Comment         Code
---------------------------------------------------------------------------------
 JavaScript             9080      1032352       131646       225231       675475
 JSON                   1139       133076          369            0       132707
 Markdown               1115       159295        46234            0       113061
 Python                  207        70457        12095         4561        53801
 C++                      64        26719         3483         3171        20065
 HTML                    211        21543         2607         1867        17069
 Sass                    112        18359         1665         1497        15197
 C/C++ Header             97        17423         2551         1711        13161
 XML                      21         9826          475           22         9329
 YAML                    247         6611          260           70         6281
 CSS                      41         7625         1157          529         5939
 Plain Text               75         1933          330            0         1603
 Makefile                 49         2624          438          738         1448
 SQL                       2         1325          238            0         1087
 Lua                       6         1209          225           36          948
 TypeScript                2         1038          141          104          793
 Less                      3          797           94           11          692
 Bourne Shell             21          840          142          123          575
 Autoconf                  4          799           74          263          462
 Lisp                      4          350           42           38          270
 ASP.NET                   6          265            0            0          265
 Handlebars                4          200           18            0          182
 C                         5          258           45           37          176
 CoffeeScript             11          112           23            9           80
 Ruby                      3           26            4            2           20
 Batch                     2           10            2            0            8
 Z Shell                   1           25            4           15            6
---------------------------------------------------------------------------------
 Total                 12532      1515097       204362       240035      1070700
---------------------------------------------------------------------------------

$ cloc codebase
   13950 text files.
    9763 unique files.                                          
    5923 files ignored.

https://github.com/AlDanial/cloc v 1.66  T=58.68 s (138.1 files/s, 20273.4 lines/s)
-----------------------------------------------------------------------------------
Language                         files          blank        comment           code
-----------------------------------------------------------------------------------
JavaScript                        6082         110844         185893         584265
JSON                              1048            331              0         123278
Python                             189          12106           8513          49919
C++                                 64           3483           3174          20062
HTML                               204           2599            165          18683
SASS                               111           1665           1078          15616
C/C++ Header                        97           2550           1711          13162
XML                                 19            242             11           7381
CSS                                 39           1157            528           5940
YAML                               156            241             66           5684
Bourne Shell                        24            474            454           2136
SQL                                  2            238              0           1087
TypeScript                           2            141            104            793
Lua                                  5            168             27            686
LESS                                 2             82             10            606
make                                25            178             40            575
m4                                   2             40              2            266
Lisp                                 3             42             38            264
Bourne Again Shell                  10             54             27            184
C                                    3             31             29            130
Smarty                               6             17             30             91
CoffeeScript                         5             16              8             65
Handlebars                           2              8              0             42
Windows Resource File                1              1              1             33
Ruby                                 2              2              2             12
DOS Batch                            2              2              0              8
zsh                                  1              4             14              7
-----------------------------------------------------------------------------------
SUM:                              8106         136716         201925         850975
-----------------------------------------------------------------------------------

PS: loc took around 2-3 seconds to finish, it would be nice to have the elapsed time reported in the result output as well. And cloc took almost a minute, so it's about 20-30x improvement not the 100x as claimed.

cgag commented 7 years ago

Hmm, will look into these later but at the moment I'm inclined to trust mine, since I believe I'm very slightly more accurate on c++, and javascript comments should be the same as c++. I'm hoping to put together a script soon to identify the files with the largest discrepancies for manual testing. Will get back to you when I do.

Re timing: was that cold cache on loc and warm on cloc? Can you try running loc twice? I just got 160x faster testing them both against a large code base (openbsd). If not, let me know.

ye commented 7 years ago

@cgag re: discrepancies yes, I think if you can have a set of baseline comparison tests that would be great and it would be very helpful to find bugs too since cloc is battlefield tested and proven for the most part.

re: speed improvement stats. I think you are right, unfortunately I didn't time the first run but subsequent runs were much faster (~0.25s to ~0.28s)!!! Now I wonder what were the stuff that got cached? And where does the cached data get stored?

cgag commented 7 years ago

For the timing, the operating system caches files it accesses in memory by default, so the first time you read it, it has to get it from disk, but the second time you try to read a file, it should be read from memory, which is much much faster. The OS will use any free memory to cache files until an application needs it.

ye commented 7 years ago

If so, I wonder why cloc is timed at the same ball park (~1 min) on repeated runs?

cgag commented 7 years ago

That's because cloc is CPU bound. It counts more slowly than it reads off disk, so the CPU is the bottleneck, which caching files in memory doesn't help. For loc, reading off of disk is the slowest portion, so making it faster through caching provides a huge speed up.

Ngo-The-Trung commented 7 years ago

I ran this on the valgrind repository (checkout r16117).

cloc/loc reported different results cloc:

    5146 text files.
    4437 unique files.
    7445 files ignored.

http://cloc.sourceforge.net v 1.60  T=11.85 s (243.7 files/s, 126250.3 lines/s)
--------------------------------------------------------------------------------
Language                      files          blank        comment           code
--------------------------------------------------------------------------------
C                              1185          97930         107742         635880
Expect                          921          24146           6947         451965
C/C++ Header                    324          13612          24090          54026
XML                             136           3870            733          21655
Assembly                         57           2573           3271           8428
make                             79            983            470           7989
C++                              28           1376           1377           7138
Teamcenter def                   13              0            213           4658
m4                                1            531              4           3913
Perl                             17            729            518           3290
Bourne Shell                    107            538            634           2160
XSLT                              6            189            125           1152
Bourne Again Shell                8             95            131            377
Haskell                           4            109             70            250
XSD                               1             17             10            211
Korn Shell                        1             31             24            150
CSS                               1             10              4             53
--------------------------------------------------------------------------------
SUM:                           2889         146739         146363        1203295
--------------------------------------------------------------------------------

loc:

--------------------------------------------------------------------------------
 Language             Files        Lines        Blank      Comment         Code
--------------------------------------------------------------------------------
 C                     1192       839672        97549       106976       635147
 C/C++ Header           324        91728        13613        24072        54043
 XML                    136        26258         3870          732        21656
 Makefile                83         9991         1071          505         8415
 Plain Text              49        10557         2593            0         7964
 Assembly                57        14272         2573         4524         7175
 C++                     28         9891         1377         1377         7137
 Autoconf                14         5751          800         1361         3590
 Perl                     4         2694          419          129         2146
 Bourne Shell             3          621           14            4          603
 Haskell                  4          429          109           70          250
 CSS                      1           67           10            4           53
--------------------------------------------------------------------------------
 Total                 1895      1011931       123998       139754       748179
--------------------------------------------------------------------------------

I've run a file-by-file diff (only on .c files) and here's the result. Lines prefixed with ">" are from loc, "<" are from cloc. A lot of off-by-1 or 2, more interestingly there are certain files where loc reports 0 lines.

I can't say that cloc is 100% correct but on the few files I inspected loc indeed miscounted the number of lines.

Also an example of something loc would report wrongly: (C file)

Bool h_clo_partial_loads_ok  = True;   /* user visible */
/* Bool h_clo_lossage_check     = False; */ /* dev flag only */

loc would return 2. cloc returns 1. Edit: fix for this

Edit 2: On this file it fails because it's ISO_8859_1 encoded and the code assumes every file is utf8

cgag commented 7 years ago

Good catch on the whitespace, I guess I should have bothered with it. I looked at the PR and it looks correct but chars.nth(pos) is probably less than ideal since nth is O(n) and we should be safe to just index into trimmed due to the is_character_boundary checks at the top of the loop.

Thanks for the valgrind diff. I'll test out the whitespace change and start digging into any differences. In my experiences cloc is off by one fairly often, but any larger differences should be interesting.

Ngo-The-Trung commented 7 years ago

Are you going to add support for other forms of encoding, or is the tool going to be limited to utf8 files only?

pvdb commented 7 years ago

ISO-8859-1 support offered by pvdb/gloc :wink:

schmilblick commented 6 years ago

Chiming in on this, I have a repo where both cloc and loc vastly misses the number of lines for Python code (60k for cloc, 8.8k for loc, 93k in reality (!)). Let me know if you want some data regarding this @cgag !

Edit: tokei seems to get the numbers 100% correct