Open DarwinJS opened 4 years ago
Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example:
Step 1 cloc --by-file --csv --out counts.csv directory
Step 2 count_bytes counts.csv
where count_bytes
is something like
#!/usr/bin/env perl use warnings; use strict; my $bytes = 0; while (<>) { my $file = (split(','))[1]; next unless $file; next if $file eq "filename"; if (!-e $file) { print "can't read $file, skipping\n"; next; } $bytes += -s "$file"; } print "$bytes total bytes\n";
A drawback to this method is that it won't work on archive (.tar
, .zip
, etc) files; you'll need to expand these out first.
The solution can easily be adapted to count bytes in files after comments are removed.
Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory
Step 2 count_bytes_no_comments counts.csv
where count_bytes_no_comments
is
#!/usr/bin/env perl use warnings; use strict; my $bytes = 0; while (<>) { my $file = (split(','))[1]; next unless $file; next if $file eq "filename"; $file .= ".No_Comments"; if (!-e $file) { print "can't read $file, skipping\n"; next; } $bytes += -s "$file"; } print "$bytes total bytes\n";
There are several benefits to having it integrated:
I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.
I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.
This seems to be available, basically, via the --categorized
arg. Perhaps not in-line in the report, or quite as "easy" to get at, but still available if a more generic script-oriented (and/or unix-philosophy-aligned) approach isn't sufficient.
The original requester asked for "by file volume in MB." which is already given a couple of ways for a given file on Unix
$ ls -l ./langs_includesec_audited.txt |awk '{ print $5 }'
221 #in bytes
or
$ du -sh ./langs_includesec_audited.txt
4.0K ./langs_includesec_audited.txt #in human readable rounded to nearest size
For anything but a massive scale use-case, running cloc and then running a script that counts the size of all the files after would work fine.
IMHO (similar to #798) this issue seems to be asking for something that other Unix tools already do. My vote is for @AlDanial to spend his valuable time adding some other awesome cloc feature or fixing known bugs!
Some commercial security scanning tools now charge by file volume in MB. I am wondering if that measurement could be taken and reported. I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).
Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)