CLOC does not support using shell-like globs to exclude files.
CLOC supports using a regular expression for excluding files or directories. However, the expression is separate for files and directories and is just one expression. Some of our filters (like .git*) support excluding both directories (.git/) and files (.gitignore), and regular expressions are complicated matters so squashing a potentially large list of regexps into one blindly with a series of or operators could potentially lead to all sorts of errors, and in the end I decided it was not worth the risk.
CLOC supports specifying files to ignore with an external file, but only for literal paths relative to the working directory.
All in all, I decided to go for a direct approach and use the external file exclusion list method by compiling within the script itself a list of files to exclude and passing it to CLOC.
Path forward
I could have chosen one of two ways:
Use the pre-existing globs in ignored_files and gnu utilities to create the list of files
Turn ignored_files in a list of regular expressions and use the re module to find the files.
The first approach is faster (as search is made within a series of C programs) but it's less portable (no direct replacement on Windows although it's possible to get gnu-utils using various methods) and less powerful than regular expressions, so I chose to go with the latter.
Problems
In the _find_ignored_files function we have three list comprehensions and a double for loop within a big for loop over all childrens of the repository root. This has a quite large performance impact
Since now we have regular expressions, we can't compile a string from the elements and pass it to wc directly, so we have to find all files within the repository with git ls-files and then remove all ignored files in post with a loop over the list of files. Also, for a repository with a lot of files, the list of ignored files could easily outgrow the shell's limit on argument list size.
Conclusions
I observed the decrease in performance to be negligible (especially with cloc), at least with our dataset (active repositories of WEEE Open), and the increase in functionality to be worth it. However, there are noticeable regressions in performance that are worth discussing.
If merged, should close #9.
Challenges
All in all, I decided to go for a direct approach and use the external file exclusion list method by compiling within the script itself a list of files to exclude and passing it to CLOC.
Path forward
I could have chosen one of two ways:
re
module to find the files.The first approach is faster (as search is made within a series of C programs) but it's less portable (no direct replacement on Windows although it's possible to get gnu-utils using various methods) and less powerful than regular expressions, so I chose to go with the latter.
Problems
_find_ignored_files
function we have three list comprehensions and a double for loop within a big for loop over all childrens of the repository root. This has a quite large performance impactwc
directly, so we have to find all files within the repository withgit ls-files
and then remove all ignored files in post with a loop over the list of files. Also, for a repository with a lot of files, the list of ignored files could easily outgrow the shell's limit on argument list size.Conclusions
I observed the decrease in performance to be negligible (especially with cloc), at least with our dataset (active repositories of WEEE Open), and the increase in functionality to be worth it. However, there are noticeable regressions in performance that are worth discussing.
If merged, should close #9.