AlDanial / cloc

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.
GNU General Public License v2.0
19.39k stars 1.02k forks source link

Exclude linguist-vendored and linguist-generated in .gitattributes when using --no-autogen #722

Closed BrianL-STCU closed 1 year ago

BrianL-STCU commented 1 year ago

GitHub supports specifying vendor-created code and generated code in a repository using Lingust extensions in the .gitattributes file, which it uses to optimally fold files in diff views. It would be nice if these were excluded with the --no-autogen option, since this is already being maintained in repos.

Example

src/Project/OpenAPIs/* linguist-generated=true
src/Project/Models/*.cs linguist-generated=true
**/packages/** linguist-vendored
**/lib/** linguist-vendored
"**/Service References/**" linguist-generated=true
"**/Web References/**" linguist-generated=true
AlDanial commented 1 year ago

Can you recommend a repo to clone that has such entries?

BrianL-STCU commented 1 year ago

I've got a couple at brianary/webcoder or brianary/scripts, but there are a lot of others.

BrianL-STCU commented 1 year ago

It looks like linguist-generated=true can be just linguist-generated now.

AlDanial commented 1 year ago

Take a shot at the update I just pushed, only test on Linux so far.

brianary commented 1 year ago
I pulled down the repo and ran `./cloc`... ```text 789 text files. 700 unique files. 96 files ignored. github.com/AlDanial/cloc v 1.97 T=0.19 s (3692.9 files/s, 367610.3 lines/s) ---------------------------------------------------------------------------------------- Language files blank comment code ---------------------------------------------------------------------------------------- Perl 8 2327 5109 26465 YAML 362 12 364 8770 Markdown 3 305 40 2860 TableGen 1 241 128 1124 ANTLR Grammar 2 200 59 1012 R 3 95 312 698 C/C++ Header 1 191 780 617 C++ 11 132 183 603 Forth 2 17 84 529 TypeScript 4 53 39 416 Logtalk 1 59 57 368 C 8 111 72 359 Windows Message File 2 89 9 348 TeX 2 36 64 265 CMake 1 36 40 261 Racket 1 32 159 247 make 4 85 159 247 SVG 1 19 4 242 Glade 1 0 22 232 DIET 1 10 4 230 Windows Resource File 1 42 45 218 Assembly 4 40 142 205 Linker Script 1 3 60 197 CSV 1 0 0 158 ReScript 1 31 43 157 Juniper Junos 1 0 8 129 Zig 1 2 10 128 Idris 2 38 90 117 ECPP 1 26 34 116 Prolog 2 43 8 114 Text 17 14 0 113 Ruby 1 11 30 111 Hoon 1 0 10 110 Imba 1 71 30 108 Dockerfile 3 18 13 106 P4 1 28 33 102 Thrift 1 57 134 97 Bourne Shell 5 14 10 96 Bourne Again Shell 1 11 19 92 Xtend 1 17 52 91 BizTalk Orchestration 1 1 3 90 Lean 1 36 20 90 Odin 1 32 56 90 kvlang 1 13 2 86 Smalltalk 2 19 5 85 Vuejs Component 1 10 2 85 Java 5 13 28 81 Circom 1 34 26 80 Scheme 1 10 18 78 Constraint Grammar 1 12 11 77 WGSL 1 5 8 76 Cairo 1 17 9 75 MXML 1 23 5 74 MATLAB 3 3 11 68 Oracle PL/SQL 1 0 15 67 Haml 1 5 16 66 Pony 1 23 43 66 Visual Basic 2 44 55 66 Swift 1 23 13 65 Fish Shell 1 14 47 62 NetLogo 1 17 14 62 RAML 1 5 3 62 Verilog-SystemVerilog 1 4 20 62 SCSS 2 16 8 59 Clean 1 10 30 58 Qt Linguist 1 0 4 57 SaltStack 1 6 1 55 Containerfile 1 5 2 53 tspeg 2 26 31 53 Pest 1 16 9 51 Meson 1 13 9 48 JSON 3 0 0 46 Fennel 1 6 3 44 JCL 1 0 18 44 HCL 1 14 36 43 Nim 1 5 13 43 Nix 1 15 15 43 OpenSCAD 1 18 3 42 Go 3 14 41 40 HolyC 1 4 14 40 Metal 1 13 10 40 ASP.NET 2 16 21 39 Raku 1 19 12 39 SQL 3 24 36 39 Agda 1 10 3 38 Ring 1 11 11 38 Web Services Description 1 4 0 36 COBOL 3 5 8 35 Haskell 4 23 26 35 RobotFramework 1 9 5 35 X++ 1 8 16 35 AsciiDoc 1 17 27 34 EJS 1 0 11 34 Godot Scene 1 4 8 34 Puppet 4 2 8 34 IPL 1 6 15 33 PO File 1 9 18 33 GLSL 1 10 14 32 WebAssembly 1 8 20 32 Mustache 2 5 7 31 Specman e 2 4 12 31 Squirrel 1 6 4 31 Python 7 16 54 30 Apex Class 1 3 6 28 C# Designer 1 8 22 28 Cake Build Script 1 6 6 28 Cucumber 1 3 2 28 Drools 1 7 16 28 Freemarker Template 1 0 2 27 Bazel 1 7 1 26 PHP 2 11 13 26 Umka 1 7 5 26 LFE 1 15 21 25 Objective-C 1 11 11 25 Scala 1 8 8 25 Visual Studio Solution 1 0 1 25 Brainfuck 1 1 3 24 Fortran 90 6 1 18 24 Haxe 1 26 99 24 Lisp 1 5 26 24 C# 4 9 7 23 peg.js 1 18 9 23 Blade 1 10 5 22 GraphQL 2 3 6 22 JSON5 1 0 4 22 Mathematica 2 24 17 22 PEG 1 24 9 22 Stata 1 7 7 22 TOML 1 8 4 22 Gleam 1 6 41 21 Jupyter Notebook 1 0 126 21 Smarty 1 1 1 21 Godot Resource 1 2 8 20 BrightScript 1 0 3 19 Igor Pro 1 4 6 19 PL/M 1 1 5 19 Solidity 1 0 2 19 TTCN 1 11 16 19 XSLT 2 0 4 19 peggy 1 25 7 19 Jai 1 4 7 18 Pascal 4 4 15 18 Windows Module Definition 1 1 1 18 Gradle 1 0 2 17 Mojo 1 6 4 17 Razor 2 6 7 17 TEAL 1 16 37 17 Futhark 1 7 35 16 Logos 2 6 3 16 Carbon 1 11 6 15 DenizenScript 1 0 6 15 Gencat NLS 1 1 4 15 JavaScript 5 3 0 15 Lem 1 11 24 15 Pig Latin 1 19 40 15 SWIG 1 4 4 15 TNSDL 1 5 3 15 Embedded Crystal 1 4 4 14 F# 1 3 6 14 Finite State Language 1 7 3 14 IDL 2 25 7 14 Derw 1 2 5 13 SugarSS 1 5 4 13 Velocity Template Language 1 0 20 13 Starlark 1 3 4 11 Nunjucks 1 0 6 10 Slim 1 0 3 10 reStructuredText 1 6 4 10 Godot Shaders 1 3 3 9 Kotlin 1 0 3 9 Mako 1 3 8 9 Properties 1 0 15 9 Svelte 1 2 2 9 Vala 1 0 5 9 Visual Studio Module 1 3 5 9 XML 3 0 5 9 F# Script 1 1 2 8 FXML 1 2 3 8 SparForte 1 6 8 8 WXML 1 3 2 8 C# Generated 1 2 16 7 Elixir 1 3 10 7 Fortran 77 2 1 8 7 INI 1 2 3 7 Lua 3 9 33 7 Chapel 1 7 35 6 VB for Applications 1 4 2 6 HTML EEx 1 1 4 5 Julia 2 4 15 5 PL/I 1 0 7 5 PlantUML 1 2 5 5 APL 1 3 6 4 Arduino Sketch 1 1 5 4 ReasonML 1 2 8 4 Rmd 1 10 19 4 WXSS 1 0 0 4 Elm 2 0 5 3 Flatbuffers 1 1 2 3 Groovy 1 0 3 3 LLVM IR 1 2 6 3 Literate Idris 1 2 2 3 NAnt script 1 1 0 3 OCaml 1 0 5 3 ProGuard 1 7 14 3 Tcl/Tk 1 1 2 3 dhall 1 6 17 3 ColdFusion 1 1 2 2 DOS Batch 1 1 2 2 Focus 1 1 2 1 MUMPS 1 0 2 1 XQuery 1 0 1 1 xBase 1 0 9 1 ---------------------------------------------------------------------------------------- SUM: 700 5804 10594 53284 ---------------------------------------------------------------------------------------- ```

I added this .gitattributes file:

./cloc linguist-generated=true
./Unix/cloc linguist-vendored
Then I ran `./cloc` again, which seemed to add one ignored file, but didn't exclude any lines of Perl code... ```text 790 text files. 700 unique files. 97 files ignored. github.com/AlDanial/cloc v 1.97 T=0.20 s (3578.0 files/s, 356176.6 lines/s) ---------------------------------------------------------------------------------------- Language files blank comment code ---------------------------------------------------------------------------------------- Perl 8 2327 5109 26465 YAML 362 12 364 8770 Markdown 3 305 40 2860 TableGen 1 241 128 1124 ANTLR Grammar 2 200 59 1012 R 3 95 312 698 C/C++ Header 1 191 780 617 C++ 11 132 183 603 Forth 2 17 84 529 TypeScript 4 53 39 416 Logtalk 1 59 57 368 C 8 111 72 359 Windows Message File 2 89 9 348 TeX 2 36 64 265 CMake 1 36 40 261 Racket 1 32 159 247 make 4 85 159 247 SVG 1 19 4 242 Glade 1 0 22 232 DIET 1 10 4 230 Windows Resource File 1 42 45 218 Assembly 4 40 142 205 Linker Script 1 3 60 197 CSV 1 0 0 158 ReScript 1 31 43 157 Juniper Junos 1 0 8 129 Zig 1 2 10 128 Idris 2 38 90 117 ECPP 1 26 34 116 Prolog 2 43 8 114 Text 17 14 0 113 Ruby 1 11 30 111 Hoon 1 0 10 110 Imba 1 71 30 108 Dockerfile 3 18 13 106 P4 1 28 33 102 Thrift 1 57 134 97 Bourne Shell 5 14 10 96 Bourne Again Shell 1 11 19 92 Xtend 1 17 52 91 BizTalk Orchestration 1 1 3 90 Lean 1 36 20 90 Odin 1 32 56 90 kvlang 1 13 2 86 Smalltalk 2 19 5 85 Vuejs Component 1 10 2 85 Java 5 13 28 81 Circom 1 34 26 80 Scheme 1 10 18 78 Constraint Grammar 1 12 11 77 WGSL 1 5 8 76 Cairo 1 17 9 75 MXML 1 23 5 74 MATLAB 3 3 11 68 Oracle PL/SQL 1 0 15 67 Haml 1 5 16 66 Pony 1 23 43 66 Visual Basic 2 44 55 66 Swift 1 23 13 65 Fish Shell 1 14 47 62 NetLogo 1 17 14 62 RAML 1 5 3 62 Verilog-SystemVerilog 1 4 20 62 SCSS 2 16 8 59 Clean 1 10 30 58 Qt Linguist 1 0 4 57 SaltStack 1 6 1 55 Containerfile 1 5 2 53 tspeg 2 26 31 53 Pest 1 16 9 51 Meson 1 13 9 48 JSON 3 0 0 46 Fennel 1 6 3 44 JCL 1 0 18 44 HCL 1 14 36 43 Nim 1 5 13 43 Nix 1 15 15 43 OpenSCAD 1 18 3 42 Go 3 14 41 40 HolyC 1 4 14 40 Metal 1 13 10 40 ASP.NET 2 16 21 39 Raku 1 19 12 39 SQL 3 24 36 39 Agda 1 10 3 38 Ring 1 11 11 38 Web Services Description 1 4 0 36 COBOL 3 5 8 35 Haskell 4 23 26 35 RobotFramework 1 9 5 35 X++ 1 8 16 35 AsciiDoc 1 17 27 34 EJS 1 0 11 34 Godot Scene 1 4 8 34 Puppet 4 2 8 34 IPL 1 6 15 33 PO File 1 9 18 33 GLSL 1 10 14 32 WebAssembly 1 8 20 32 Mustache 2 5 7 31 Specman e 2 4 12 31 Squirrel 1 6 4 31 Python 7 16 54 30 Apex Class 1 3 6 28 C# Designer 1 8 22 28 Cake Build Script 1 6 6 28 Cucumber 1 3 2 28 Drools 1 7 16 28 Freemarker Template 1 0 2 27 Bazel 1 7 1 26 PHP 2 11 13 26 Umka 1 7 5 26 LFE 1 15 21 25 Objective-C 1 11 11 25 Scala 1 8 8 25 Visual Studio Solution 1 0 1 25 Brainfuck 1 1 3 24 Fortran 90 6 1 18 24 Haxe 1 26 99 24 Lisp 1 5 26 24 C# 4 9 7 23 peg.js 1 18 9 23 Blade 1 10 5 22 GraphQL 2 3 6 22 JSON5 1 0 4 22 Mathematica 2 24 17 22 PEG 1 24 9 22 Stata 1 7 7 22 TOML 1 8 4 22 Gleam 1 6 41 21 Jupyter Notebook 1 0 126 21 Smarty 1 1 1 21 Godot Resource 1 2 8 20 BrightScript 1 0 3 19 Igor Pro 1 4 6 19 PL/M 1 1 5 19 Solidity 1 0 2 19 TTCN 1 11 16 19 XSLT 2 0 4 19 peggy 1 25 7 19 Jai 1 4 7 18 Pascal 4 4 15 18 Windows Module Definition 1 1 1 18 Gradle 1 0 2 17 Mojo 1 6 4 17 Razor 2 6 7 17 TEAL 1 16 37 17 Futhark 1 7 35 16 Logos 2 6 3 16 Carbon 1 11 6 15 DenizenScript 1 0 6 15 Gencat NLS 1 1 4 15 JavaScript 5 3 0 15 Lem 1 11 24 15 Pig Latin 1 19 40 15 SWIG 1 4 4 15 TNSDL 1 5 3 15 Embedded Crystal 1 4 4 14 F# 1 3 6 14 Finite State Language 1 7 3 14 IDL 2 25 7 14 Derw 1 2 5 13 SugarSS 1 5 4 13 Velocity Template Language 1 0 20 13 Starlark 1 3 4 11 Nunjucks 1 0 6 10 Slim 1 0 3 10 reStructuredText 1 6 4 10 Godot Shaders 1 3 3 9 Kotlin 1 0 3 9 Mako 1 3 8 9 Properties 1 0 15 9 Svelte 1 2 2 9 Vala 1 0 5 9 Visual Studio Module 1 3 5 9 XML 3 0 5 9 F# Script 1 1 2 8 FXML 1 2 3 8 SparForte 1 6 8 8 WXML 1 3 2 8 C# Generated 1 2 16 7 Elixir 1 3 10 7 Fortran 77 2 1 8 7 INI 1 2 3 7 Lua 3 9 33 7 Chapel 1 7 35 6 VB for Applications 1 4 2 6 HTML EEx 1 1 4 5 Julia 2 4 15 5 PL/I 1 0 7 5 PlantUML 1 2 5 5 APL 1 3 6 4 Arduino Sketch 1 1 5 4 ReasonML 1 2 8 4 Rmd 1 10 19 4 WXSS 1 0 0 4 Elm 2 0 5 3 Flatbuffers 1 1 2 3 Groovy 1 0 3 3 LLVM IR 1 2 6 3 Literate Idris 1 2 2 3 NAnt script 1 1 0 3 OCaml 1 0 5 3 ProGuard 1 7 14 3 Tcl/Tk 1 1 2 3 dhall 1 6 17 3 ColdFusion 1 1 2 2 DOS Batch 1 1 2 2 Focus 1 1 2 1 MUMPS 1 0 2 1 XQuery 1 0 1 1 xBase 1 0 9 1 ---------------------------------------------------------------------------------------- SUM: 700 5804 10594 53284 ---------------------------------------------------------------------------------------- ```
AlDanial commented 1 year ago

The failure was caused by the spurious ./ in the path defined by the pattern. Once I expanded it to a canonical path I got the expected behavior.

However...

I took a closer look at the Text::Glob module I vendored in for this issue and saw it is rudimentary and won't handle ** recursive matches and is also confused by quoted paths. I'll need to come up with my own way to deal with these meaning it will take a while.

AlDanial commented 1 year ago

ea192f1 is my next attempt at this, please give it a try

brianary commented 1 year ago

Unfortunately, I still don't see a difference when adding that .gitattributes.

AlDanial commented 1 year ago

Are you using --git --no-autogen? Without these switches the first few lines of output I see are

github.com/AlDanial/cloc v 1.97  T=2.33 s (335.9 files/s, 169801.5 lines/s)                                                                                     
----------------------------------------------------------------------------------------
Language                              files          blank        comment           code
----------------------------------------------------------------------------------------
Perl                                     36          23797          49078         265666
Text                                     47           2330              0          17361
YAML                                    370             12            372           9075
Markdown                                  4            305             40           2862

However with --git --no-autogen the two Perl files are omitted:

github.com/AlDanial/cloc v 1.97  T=1.74 s (446.6 files/s, 208591.7 lines/s)
----------------------------------------------------------------------------------------
Language                              files          blank        comment           code
----------------------------------------------------------------------------------------
Perl                                     34          21624          44084         241342
Text                                     47           2330              0          17361
YAML                                    370             12            372           9075
Markdown                                  4            305             40           2862
brianary commented 1 year ago

I did have the wrong options, but retrying on Windows with the right ones, both still start with:

github.com/AlDanial/cloc v 1.97  T=1.68 s (420.2 files/s, 41535.8 lines/s)
----------------------------------------------------------------------------------------
Language                              files          blank        comment           code
----------------------------------------------------------------------------------------
Perl                                      8           2326           5113          26552
YAML                                    365             12            367           8830
Markdown                                  3            305             40           2860
TableGen                                  1            241            128           1124

Something seems way off for us to be getting such different results in either case.

AlDanial commented 1 year ago

Couple of things: I had a bunch of extra files in my dir that inflated my count. Also the glitch had to do with file path separators, I wasn't treating \ and / consistently in this code branch. The latest push should fix it. Here are my results on Windows, using extra filters to just narrow it down to the 8 or 10 Perl files:

C:>perl cloc --git --no-autogen --by-file --include-lang=Perl .
     834 text files.
     721 unique files.
     839 files ignored.

github.com/AlDanial/cloc v 1.97  T=2.00 s (5.0 files/s, 26986.6 lines/s)
-----------------------------------------------------------------------------------
File                                            blank        comment           code
-----------------------------------------------------------------------------------
.\cloc-1.96.pl                                   1485           3656          12103
.\cloc-1.92.pl                                   1481           3628          11998
.\cloc.ok                                        1484           3635          11950
.\Unix\t\00_C.t                                     3              3           1282
.\Unix\t\01_opts.t                                 86             28            697
.\Unix\t\02_git.t                                  15              1            134
.\tests\inputs\issues\380\wrapper.pl               44             72             71
.\sqlite_formatter                                  5             15             42
.\tests\inputs\issues\420\mixed_case_ext.Pl         5              1             14
.\tests\inputs\diff\B\extra_file.pl                 0              0              2
-----------------------------------------------------------------------------------
SUM:                                             4608          11039          38293
-----------------------------------------------------------------------------------

If I don't include --no-autogen it won't apply the rules in .gitattributes and I'll see two extra Perl files:

C:>perl cloc --git --by-file --include-lang=Perl .
     834 text files.
     721 unique files.
     837 files ignored.

github.com/AlDanial/cloc v 1.97  T=2.02 s (5.9 files/s, 42343.3 lines/s)
-----------------------------------------------------------------------------------
File                                            blank        comment           code
-----------------------------------------------------------------------------------
.\cloc                                           1488           3666          12290
.\cloc-1.96.pl                                   1485           3656          12103
.\Unix\cloc                                       685           1328          12041
.\cloc-1.92.pl                                   1481           3628          11998
.\cloc.ok                                        1484           3635          11950
.\Unix\t\00_C.t                                     3              3           1282
.\Unix\t\01_opts.t                                 86             28            697
.\Unix\t\02_git.t                                  15              1            134
.\tests\inputs\issues\380\wrapper.pl               44             72             71
.\sqlite_formatter                                  5             15             42
.\tests\inputs\issues\420\mixed_case_ext.Pl         5              1             14
.\tests\inputs\diff\B\extra_file.pl                 0              0              2
-----------------------------------------------------------------------------------
SUM:                                             6781          16033          62624
-----------------------------------------------------------------------------------

Your file lists may be smaller but if you do the two runs one should produce two fewer files than the other.

BrianL-STCU commented 1 year ago

I think that has done it! :)

AlDanial commented 1 year ago

Glad to hear it, thanks for your patience with testing.