Standard-compliant `wc` implementation

ashvardanian commented 8 months ago

The 4c738ea446cb8f9041077ac6364557527e8fc427 commit introduces a prototype for StringZilla-based Command-Line toolkit, including the wc utility replacement. The original prototype suggests a 3x performance improvement opportunity, but it can't currently handle multiple inputs and flags. Those must be easy to add in cli/wc.py purely in the Python layer. We are aiming to match the following specification:

$ wc --help
Usage: wc [OPTION]... [FILE]...
  or:  wc [OPTION]... --files0-from=F
Print newline, word, and byte counts for each FILE, and a total line if
more than one FILE is specified.  A word is a non-zero-length sequence of
characters delimited by white space.

With no FILE, or when FILE is -, read standard input.

The options below may be used to select which counts are printed, always in
the following order: newline, word, character, byte, maximum line length.
  -c, --bytes            print the byte counts
  -m, --chars            print the character counts
  -l, --lines            print the newline counts
      --files0-from=F    read input from the files specified by
                           NUL-terminated names in file F;
                           If F is - then read names from standard input
  -L, --max-line-length  print the maximum display width
  -w, --words            print the word counts
      --help     display this help and exit
      --version  output version information and exit

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Report any translation bugs to <https://translationproject.org/team/>
Full documentation <https://www.gnu.org/software/coreutils/wc>
or available locally via: info '(coreutils) wc invocation'

ashvardanian commented 8 months ago

I've merged some intermediate patches by @ghazariann, but some parts have to be reimplemented. Like this:

    if args.max_line_length:
        max_line_length = max(len(line) for line in str(mapped_bytes).split("\n"))
        counts["max_line_length"] = max_line_length

It is expensive to convert to str and even more expensive to split it.

MarkReedZ commented 5 months ago

We're missing tests and don't handle locale.

Some thoughts on test

Stdin

Redirection - Note that --files0-from needs to pull a nul delimited list of filenames

find . -name '*.[ch]' -print0 |   wc -L --files0-from=-
cat xxx | wc -l

Word Count

We only count spaces so add tests for adjacent and other whitespace.

Line Count

If a file ends in a non-newline character, its trailing partial line is not counted.

Max Line Length

Tabs are set at every 8th column. Display widths of wide characters are considered. Non-printable characters are given 0 width.

Locale

-m --chars Print only the character counts, as per the current locale. ( utf-8, and utf-16 support needed ) Encoding errors are not counted. locale.getencoding / setencoding

We'd need to scan for non ascii codepoints in the input.

-w --words Uses locale specific whitespace.

wc likely doesn't really do this per locale. We'd need to test a few locales. To do this we'd have to scan for non ascii and have a list of unicode whitespace to compare.

References

https://www.gnu.org/software/coreutils/manual/html_node/wc-invocation.html#wc-invocation https://www.mkssoftware.com/docs/man1/wc.1.asp#:~:text=wc%20counts%20the%20number%20of,16%2Dbit%20wide%20Unicode%20files.

ashvardanian commented 5 months ago

Can we detect those locale-based settings in the Python implementation of wc, without changing the core C implementation and the Python binding?

MarkReedZ commented 5 months ago

For counting characters we can locale.getencoding() in python then a naive approach would be len(bytes.decode('utf-8')) which would not be performant. Ultimately we'd want to be able to scan for unicode characters ( & 0x80 ) and consume them as the character could be 2-4 bytes. If the library does not have a way to find bytes with the first bit set (& 0x80) we'd have to add it.

For counting words I believe we want a find_charset function that we can use with the whitespace character set.

ashvardanian commented 5 months ago

For the first part we can temporarily compensate that by performing several runs over data - one for each multi-byte rune.

MarkReedZ commented 5 months ago

UTF-8 looks like this - you can count bits for the character size once you see the left most bit set. Languages like Chinese will be all unicode characters. I speak Chinese so optimized this in mrjson. I'll setup tests next.

    110xxxxx 10xxxxxx
    1110xxxx 10xxxxxx 10xxxxxx
    11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

ashvardanian / StringZilla

Standard-compliant `wc` implementation #97