expr-fi / fastlwc

SIMD-enhanced word counter
MIT License
242 stars 10 forks source link

Beating wc with C

Background

As you're here, you've probably seen those posts. Arguably, this one might be a bit more boring, as there's no brief introduction to a weird language here; it's all just the C you already know (or don't), and the algorithm used is just as simple as the task at hand (more on that below). Still, intrigued by the entries mentioned above, the goal here is to find out exactly how fast is fast when it comes to wc.

A word on words, or the differences between actual wc implementations

A word, as far as wc is concerned, is a non-zero-length string of characters, delimited by white space (POSIX.1-2017). As some implementations have noted, no distinction is made between printable and non-printable characters, yet for example GNU wc ignores non-printable (non-whitespace) characters when counting words. Historical implementations regarded only spaces, tabs and newlines as whitespace – a straightforward approach, which can still be seen in use today. POSIX.1, however, recommends the use of an equivalent of isspace() from the C standard library.

All non-historical implementations support multibyte characters, including multibyte whitespace characters (when using a multibyte locale). However, some of them only do so when compelled with the -m option, which is used to print input lengths in characters instead of bytes. Other implementations recognise multibyte whitespace characters when counting words regardless of that option. This not only alters the behaviour, but has also considerable performance implications when wc is not given any options, i.e. the default use case, as highlighted below. (This disparity has been mistakenly attributed to different reasons in previous discussions.)

$ export LC_CTYPE=en_US.UTF-8
$ time_real bin/bsd-wc sample.txt
real 4.70
$ time_real /usr/bin/wc sample.txt
real 8.92
$ time_real bin/bsd-wc -lmw sample.txt
real 20.53
$ time_real /usr/bin/wc -lmw sample.txt
real 8.93

Further, the behaviour of iswspace() from the C standard library might surprise some, as it does not report non-breaking spaces as spaces, unlike many similar functions in other languages, or what intuition might perhaps suggest. GNU wc chooses to treat those characters as word separators, anyway.

Finally, rounding up this non-exhaustive list are the varying approaches to invalid multibyte sequences (when they are decoded at all, as touched earlier). Namely, some implementations count all undecodable bytes as non-whitespace characters, for both character and word counting, while others ignore them likewise.

All that being said, this application “averts” these issues by being far simpler; it doesn't support the -m option (in fact, it doesn't support any options at all, because it's not a replacement for your wc!) or multibyte characters in general; in other words, its target is to behave just like Apple's wc when given no options. For a fairer performance comparison, the non-multibyte POSIX locale is used for the tests at the end. As you can see below, that, too, has a significant effect on performance.

$ LC_CTYPE=POSIX time_real /usr/bin/wc sample.txt
real 3.96
$ LC_CTYPE=en_US.UTF-8 time_real /usr/bin/wc sample.txt
real 8.92

About the algorithm

As the saying goes about some software patents reducing basically to “doing X, with a computer”, what's going on here is little more than “doing X, with SIMD”. Word counting in wc is typically done by looping over the input characters, one by one, keeping state of the whitespace-ness of the previous character to know when to increase the counter. This process can be expressed as a couple of binary operations, by first mapping the input as whitespace (0) and non-whitespace (1). Shifting that binary value by one digit and inverting it produces a mask that can be used to find word boundaries, which computers are great at counting.

Just a sample (multiple   spaces).
1111010111111011111111100011111111             =[1]
0111101011111101111111110001111111 SHIFT [1]   =[2]
1000010100000010000000001110000000 NOT [2]     =[3]
1000010100000010000000000010000000 [1] AND [3]

About the files

fastlwc.c (single-threaded)
fastlwc-mt.c uses multiple threads with seekable inputs (you can control the number of threads used by setting the environment variable OMP_NUM_THREADS; don't expect to make any real gains with SMT/Hyper-threading)
bsd-wc.c is the reference implementation we test against (it's basically wc from opensource.apple.com)

Enough. Is it any faster?

Well, as always, it depends. If your results match mine, you might see something like this:

$ ls -hs big256.txt
1.6G big256.txt
$ export LC_CTYPE=POSIX
$ time -p /bin/cat big256.txt >/dev/null
real 0.17
user 0.00
sys 0.17
$ time -p /usr/bin/wc big256.txt
  32884992  280497920 1661098496 big256.txt
real 6.55
user 6.29
sys 0.26
$ time -p bin/bsd-wc big256.txt
 32884992 280497920 1661098496 big256.txt
real 7.79
user 7.41
sys 0.36
$ time -p bin/fastlwc big256.txt
 32884992 280497920 1661098496 big256.txt
real 0.22
user 0.06
sys 0.16
$ time -p bin/fastlwc-mt big256.txt
 32884992 280497920 1661098496 big256.txt
real 0.07
user 0.10
sys 0.18

For additional remarks, see the utf-8 directory.