Open deadPix3l opened 5 years ago
Messed with this for a bit, found some wc.c source code here.
The important bit is this:
/* Return true if C is a valid word constituent */
static int
isword (unsigned char c)
{
return isalpha (c);
}
/* Increase character and, if necessary, line counters */
#define COUNT(c) \
ccount++; \
if ((c) == '\n') \
lcount++;
/* Get next word from the input stream. Return 0 on end
of file or error condition. Return 1 otherwise. */
int
getword (FILE *fp)
{
int c;
int word = 0;
if (feof (fp))
return 0;
while ((c = getc (fp)) != EOF)
{
if (isword (c))
{
wcount++;
break;
}
COUNT (c);
}
for (; c != EOF; c = getc (fp))
{
COUNT (c);
if (!isword (c))
break;
}
return c != EOF;
}
The problem with your current code:
if curr_char in b'\n\t ': file_metrics['words'] += 1
Is that it increments anytime it sees ('\n', '\t', ' ') so if there's many of these in a row, it will just keep incrementing and the results will be incorrect.
That's why the above C is kinda gross, it translate roughly to this in python:
curr_char = f.read(1)
while curr_char != b'':
while curr_char != b'':
if curr_char not in b'\n\t ':
file_metrics['words'] += 1
break
file_metrics['bytes'] += 1
if curr_char == b'\n':
file_metrics['lines'] += 1
curr_char = f.read(1)
while curr_char != b'':
file_metrics['bytes'] += 1
if curr_char == b'\n':
file_metrics['lines'] += 1
if curr_char in b'\n\t ':
break
curr_char = f.read(1)
curr_char = f.read(1)
The above code is working perfect on files that aren't binary. It's closer for binary files, but not exact. I'm not sure why, but maybe this helps.
I have looked at that before. Have you executed your sample code? It seems to me that isalpha doesn't count numbers and symbols. There for "word2word" is technically 2 words, and the counter is incremented on the first alpha character (the "w")
I had something like this running but never committed it because it was dirty, and while better, still inaccurate. But I will give this a shot again.
I'm not really concerned with binary files because word and line count on binary doesn't make much logical sense or provide any value.
Currently darkbox.wc counts white space chars ('\n' '\t' and ' ') to determine word breaks. This results in a number which is much higher than GNU.
Figure out how GNU wc determines word boundaries.