wc word count incorrect

deadPix3l commented 5 years ago

Currently darkbox.wc counts white space chars ('\n' '\t' and ' ') to determine word breaks. This results in a number which is much higher than GNU.

Figure out how GNU wc determines word boundaries.

vesche commented 5 years ago

Messed with this for a bit, found some wc.c source code here.

The important bit is this:

/* Return true if C is a valid word constituent */
static int
isword (unsigned char c)
{
    return isalpha (c);
}

/* Increase character and, if necessary, line counters */
#define COUNT(c)       \
    ccount++;        \
    if ((c) == '\n') \
        lcount++;

/* Get next word from the input stream. Return 0 on end
of file or error condition. Return 1 otherwise. */
int
getword (FILE *fp)
{
    int c;
    int word = 0;

    if (feof (fp))
        return 0;

    while ((c = getc (fp)) != EOF)
        {
        if (isword (c))
            {
            wcount++;
            break;
            }
        COUNT (c);
        }

    for (; c != EOF; c = getc (fp))
        {
        COUNT (c);
        if (!isword (c))
            break;
        }

    return c != EOF;
}

The problem with your current code:

if curr_char in b'\n\t ': file_metrics['words'] += 1

Is that it increments anytime it sees ('\n', '\t', ' ') so if there's many of these in a row, it will just keep incrementing and the results will be incorrect.

That's why the above C is kinda gross, it translate roughly to this in python:

curr_char = f.read(1)
while curr_char != b'':

    while curr_char != b'':
        if curr_char not in b'\n\t ':
            file_metrics['words'] += 1
            break

        file_metrics['bytes'] += 1
        if curr_char == b'\n':
            file_metrics['lines'] += 1

        curr_char = f.read(1)

    while curr_char != b'':
        file_metrics['bytes'] += 1
        if curr_char == b'\n':
            file_metrics['lines'] += 1

        if curr_char in b'\n\t ':
            break

        curr_char = f.read(1)

    curr_char = f.read(1)

The above code is working perfect on files that aren't binary. It's closer for binary files, but not exact. I'm not sure why, but maybe this helps.

deadPix3l commented 5 years ago

I have looked at that before. Have you executed your sample code? It seems to me that isalpha doesn't count numbers and symbols. There for "word2word" is technically 2 words, and the counter is incremented on the first alpha character (the "w")

I had something like this running but never committed it because it was dirty, and while better, still inaccurate. But I will give this a shot again.

I'm not really concerned with binary files because word and line count on binary doesn't make much logical sense or provide any value.

AbnormalSec / darkbox

wc word count incorrect #22