AlDanial / cloc

cloc counts blank lines, comment lines, and physical lines of source code in many programming languages.
GNU General Public License v2.0
19.81k stars 1.02k forks source link

Count File Size (MB) #504

Open DarwinJS opened 4 years ago

DarwinJS commented 4 years ago

Some commercial security scanning tools now charge by file volume in MB. I am wondering if that measurement could be taken and reported. I am thinking just raw file size - not any attempt to estimate actual code versus comments (unless it is super easy).

Maybe a further version could rewrite files without comments and get an estimate of true code MB - but I'd be happy with just a raw number for a "minimum viable feature" release :)

AlDanial commented 4 years ago

Sure, that measurement can be taken and recorded--but I don't think cloc needs to do that as a new feature. Instead, use cloc to collect the names of the source files, then run a trivial add-up-the-file-sizes script. For example: Step 1 cloc --by-file --csv --out counts.csv directory Step 2 count_bytes counts.csv where count_bytes is something like

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";

A drawback to this method is that it won't work on archive (.tar, .zip, etc) files; you'll need to expand these out first.

The solution can easily be adapted to count bytes in files after comments are removed. Step 1 cloc --strip-comments No_Comments --original-dir --by-file --csv --out counts.csv directory Step 2 count_bytes_no_comments counts.csv where count_bytes_no_comments is

#!/usr/bin/env perl
use warnings;
use strict;
my $bytes = 0;
while (<>) {
    my $file = (split(','))[1];
    next unless $file;
    next if $file eq "filename";
    $file .= ".No_Comments";
    if (!-e $file) {
        print "can't read $file, skipping\n";
        next;
    }
    $bytes += -s "$file";
}
print "$bytes total bytes\n";
DarwinJS commented 4 years ago

There are several benefits to having it integrated:

I was also thinking of an implementation detail that might make this super-efficient. If you are already creating storage (like a variable) that contains the code with comments stripped - maybe a size could be taken at that point and then just add an overhead value to create a "file size" estimate. Maybe PerFileOverheadBytes could be a built-in default variable and overrideable by users with a parameter - so they could tune it to their liking.

I was also thinking of building a CI plugin around this similar to these: https://gitlab.com/guided-explorations/ci-cd-plugin-extensions.

cdeszaq commented 5 months ago

This seems to be available, basically, via the --categorized arg. Perhaps not in-line in the report, or quite as "easy" to get at, but still available if a more generic script-oriented (and/or unix-philosophy-aligned) approach isn't sufficient.

includesec-erik commented 4 months ago

The original requester asked for "by file volume in MB." which is already given a couple of ways for a given file on Unix

$ ls -l ./langs_includesec_audited.txt |awk '{ print $5 }'
221 #in bytes
or
$ du -sh ./langs_includesec_audited.txt
4.0K    ./langs_includesec_audited.txt #in human readable rounded to nearest size

For anything but a massive scale use-case, running cloc and then running a script that counts the size of all the files after would work fine.

IMHO (similar to #798) this issue seems to be asking for something that other Unix tools already do. My vote is for @AlDanial to spend his valuable time adding some other awesome cloc feature or fixing known bugs!