On the performance of radix sort.

It seems to me that haskell version of radix sort in examples/sort/radix/ can be easily optimised to outperform HVM significantly. With just -fllvm and these three bangs added

- data Arr = Null | Leaf  Word64 | Node  Arr  Arr deriving Show
- data Map = Free | Used         | Both  Map  Map deriving Show
+ data Arr = Null | Leaf !Word64 | Node  Arr  Arr deriving Show
+ data Map = Free | Used         | Both !Map !Map deriving Show

on my Ryzen 3500U (4c8t) I get

GHC-OLD-1 | sort/radix | 20 | 5.794s
GHC-OLD-1 | sort/radix | 21 | 11.456s
GHC-OLD-1 | sort/radix | 22 | 22.594s
GHC-OLD-1 | sort/radix | 23 | 45.181s
GHC-OLD-1 | sort/radix | 24 | 117.001s

GHC-NEW-1 | sort/radix | 20 | 0.713s
GHC-NEW-1 | sort/radix | 21 | 1.451s
GHC-NEW-1 | sort/radix | 22 | 2.916s
GHC-NEW-1 | sort/radix | 23 | 5.835s
GHC-NEW-1 | sort/radix | 24 | 11.682s

HVM-1 | sort/radix | 20 | 6.007s
HVM-1 | sort/radix | 21 | 11.180s
HVM-1 | sort/radix | 22 | 21.369s
HVM-1 | sort/radix | 23 | 40.562s
HVM-1 | sort/radix | 24 | 75.365s

HVM-2 | sort/radix | 20 | 3.945s
HVM-2 | sort/radix | 21 | 6.953s
HVM-2 | sort/radix | 22 | 13.370s
HVM-2 | sort/radix | 23 | 24.608s
HVM-2 | sort/radix | 24 | 42.997s

HVM-4 | sort/radix | 20 | 2.886s
HVM-4 | sort/radix | 21 | 4.953s
HVM-4 | sort/radix | 22 | 9.374s
HVM-4 | sort/radix | 23 | 16.769s
HVM-4 | sort/radix | 24 | 27.646s

HVM-8 | sort/radix | 20 | 2.377s
HVM-8 | sort/radix | 21 | 4.121s
HVM-8 | sort/radix | 22 | 7.444s
HVM-8 | sort/radix | 23 | 14.349s
HVM-8 | sort/radix | 24 | 23.227s

I.e. single-threaded haskell version is 2-3 times faster than hvm with 8 threads. That makes me question the value of the original benchmark.

Now I really wonder what would it take to optimise hvm version to beat haskell again. I find this question really important, as I obviously like the idea of good performance on non-optimised code, but the ability to get high performance after a reasonable amount of manual optimisations is, in my opinion, much more important.

Edit: HVM now has strict arguments. See this comment for updated benchmarks.

Wow! GHC getting a 10x speedup with a few small changes was unexpected to me. But that's not too surprising either. The assembly generated by GHC will sometimes be vastly superior than HVM's, simply because it has received a few orders of magnitude more investment and development - and I hope I've been clear about that in my communication. For example, if you write a tail-recursive function, GHC will generate a tight loop, while HVM will do tons of unnecessary allocations, making it much slower. In this case, there are several allocations on the base case that could be skipped and inlined thanks to the ! annotations, resulting in the speedup you see. In particular, both Leaf n and Both _ _ could be represented by a single word, but HVM doesn't have unboxing yet, so it is allocating >2x more memory than your unboxed GHC version.

Note that's not a limitation of the tech, but a reflect of the project's maturity. GHC was released 30 years ago and has been optimized and improved by major universities and companies over 3 decades. Meanwhile, HVM was mostly developed by a single developer over a few months of his spare time. What I want to show with this project though is that the underlying tech is really solid and, with proper investment, has a lot of potential to become a powerful tool for humanity. Once we compile to LLVM and implement inlining, unboxing, proper loops and branching, TCO, CSE, strictness/atomics analyzer, and so on, HVM will be so much faster. But, for now, that's what we got.

I hope people don't get discouraged of the tech if it doesn't always outperform GHC today - I mean, given the astronomical investment diff, it's a miracle it does so it in so many cases - but I can respect if that's the ultimate take here.

GHC getting a 10x speedup with a few small changes was unexpected to me.

Sadly, it is quite a common thing, especially with strictness-related changes. I am by no means happy about that, but so is the life.

In this case, there are several allocations on the base case that could be skipped and inlined thanks to the ! annotations, resulting in the speedup you see. ..., but HVM doesn't have unboxing yet, so it is allocating >2x more memory than your unboxed GHC version.

I should note that it is not unboxing that made most of the speedup for ghc here, but elimination of a space leak. To wit, instead of making fields strict, I could also make merge use strict application

- merge (Both a b) (Both c d) = Both (merge a c) (merge b d)
+ merge (Both a b) (Both c d) = (Both $! merge a c) $! merge b d

and it would result in following timings

GHC_STRICT_APPLY-1 | sort/radix | 20 | 0.826s
GHC_STRICT_APPLY-1 | sort/radix | 21 | 1.728s
GHC_STRICT_APPLY-1 | sort/radix | 22 | 3.443s
GHC_STRICT_APPLY-1 | sort/radix | 23 | 6.917s
GHC_STRICT_APPLY-1 | sort/radix | 24 | 13.959s

GHC_STRICT_FIELDS-1 | sort/radix | 20 | 0.703s
GHC_STRICT_FIELDS-1 | sort/radix | 21 | 1.444s
GHC_STRICT_FIELDS-1 | sort/radix | 22 | 2.876s
GHC_STRICT_FIELDS-1 | sort/radix | 23 | 5.903s
GHC_STRICT_FIELDS-1 | sort/radix | 24 | 11.660s

So while unboxing Word64 gives a notable speedup, what is really important here is avoiding space leak. As far as I understand, HVM as of today has no analogues to seq, which is super-important for performance in thunk-based lazy evaluated runtimes. In the original post I was mostly asking whether seq of some analogue is appliable to HVM.

..., so it is allocating >2x more memory than your unboxed GHC version.

I don't know how to get allocation stats with HVM, but the difference of max heap residence is indeed huge here: 1030 MB for strict-apply and 991 MB strict-fields vs. 10816 MB for HVM.

By the way, it also clearly shows that the impact of unboxing here is quite low. The real difference here makes strictness - the original, leaky program used to require about 17 GB of memory.

In particular, both Leaf n and Both _ _ could be represented by a single word, ...

I think I am lost here... I can imagine packing 60-bit integer and tag into a single word, but I don't think I understand how you're going to pack two pointers and tag into a word. FYI, even with Word64 unboxed, Leaf and Both use 2 and 3 words of heap resp. with ghc. No magic here.

Note that's not a limitation of the tech, but a reflect of the project's maturity...

I understand that. I was merely making a point that claims that HVM can sort linked lists 9 times faster than GHC are not exactly true as of today.

I hope people don't get discouraged of the tech if it doesn't always outperform GHC today...

To be honest, as of now it seems to me more like 'HVM quite rarely outperforms GHC today'. HVM clearly wins on code requiring optimal reduction, but on a dozen of others, much more realistic, I dare to say, benchmarks I've tried HVM seems to seriously slower than properly written haskell, often even asymptotically so (likely because of #60; I would probably post some code samples there soon).

Now, to be clear, I'm by no means trying to say that HVM is useless or that you should abandon it. I understand that the project is in early stages of development, has no/little funding, etc. I just find claims like

Despite being relatively new, it already beats mature compilers in many cases, ...

..., allowing it to sort large lists 9x faster than GHC!

quite misleading with today's HVM.

I wrote two identical programs, benchmarked them and reported that HVM performed significantly better, and backed up my claims with replicable code. You then took that code, modified it by adding relevant optimizations that are only implemented on GHC (for now), and pointed your changes sped it up considerably. Well, sure, I see how the fact that small changes cause GHC to outperform HVM is relevant, but I wasn't aware of that before I published the results, so it seems dishonest to call me misleading. In fact, I was careful enough to claim that HVM performs better given identical inputs, and that is still true. I also never claimed that HVM outperforms GHC in all cases. Quite the opposite; across every text I write, I'm overly careful with my wording to make it clear that the improvements are limited in scope, and to only promise what I know we can deliver. That sentence you quoted, in particular, should definitely have been worded better, and I've just improved it.

What I do claim, though - and I stand by that claim - is that HVM has the potential to outperform GHC in a significant number of cases (but not all) once sufficiently mature. Issue #60 is indeed a serious theoretical issue that could be problematic, as is the fact HVM doesn't cover all λ-terms. These issues may mean it will never be able to fully replace GHC, and that's fine, since this isn't the goal at all. The goal is to provide a new tool capable of empowering programmers, and let them do things that weren't possible before. I'm confident HVM is set to accomplish that goal. This current thread, on the other hands, is just pointing the obvious - i.e., that GHC outperforms HVM in some cases today. You don't even need to go that far: just write a simple tail-recursive loop, and GHC will probably run it 10x faster than HVM. I never claimed otherwise and, if you see any other part of the vast written documentation that sounds misleading in that sense (or if you think I could add a remark somewhere), please let me know and I will fix it ASAP.

I think I am lost here... I can imagine packing 60-bit integer and tag into a single word, but I don't think I understand how you're going to pack two pointers and tag into a word.

I meant that Both _ _ can be represented by a single word when both fields are a base constructor; isn't that how GHC unboxes it in that case? If no, I guess I'm mistaken and I apologize; that was just a guess.

In the original post I was mostly asking whether seq of some analogue is appliable to HVM.

Yes, we have plans to add seq, although it would need to be linear to make sense in the context of HVM.

I don't know how to get allocation stats with HVM, but the difference of max heap residence is indeed huge here: 1030 MB for strict-apply and 991 MB strict-fields vs. 10816 MB for HVM.

HVM still have a very naive allocator that reserves ~3/4 of the available system memory when it starts up, so it'd use that amount of memory regardless of what you ran. As I've been saying, HVM is still very immature tech, that does a lot of things sub-optimally, and has a vast range of potential optimizations to be applied.

To be honest, as of now it seems to me more like 'HVM quite rarely outperforms GHC today'. HVM clearly wins on code requiring optimal reduction, but on a dozen of others, much more realistic, I dare to say, benchmarks I've tried HVM seems to seriously slower than properly written haskell, often even asymptotically so

Meanwhile, my experience is the opposite. I've implemented real-world programs in HVM and managed to achieve extremely good performance. One of the most notable examples is Kind, which is a complete proof checker that outperforms all the alternatives (Coq, Idris, Agda and Lean) by a significant margin. The benchmarks are already strong, and they're using an old, single-thread, non-compiled version of HVM; this should improve by about 10x soon. So, again, this is an entire proof checker, a complete real-world application, significantly outperforming the alternatives, which were written in Ocaml and Haskell. I can't see how one can see these overwhelming results and still claim with a straight face that HVM has no real-world application.

You then took that code, modified it by adding relevant optimizations that are only implemented on GHC (for now), and pointed your changes sped it up considerably.

And then I came here, asking for whether it is possible to optimise HVM version further. Because in my opinion, a high-performance runtime is that runtime which allows archiving good performance with enough knowledges and reasonable efforts. I definitely lack knowledges about HVM, so I asked for advice. But looks like HVM as of today just isn't a high-performance runtime according to my definition.

In fact, I was careful enough to claim that HVM performs better given identical inputs, and that is still true.

No doubt, I'm by no means saying you made up the results; they are perfectly reproducible. However, I feel like the fact that the program is suffering from severe space leak that is trivial to fix on GHC and impossible to fix on today's HVM is really important here, and deserves at least a small note. That is, I don't mean that these are the benchmark results are misleading, rather I think that the benchmark itself hardly illustrates everything about performance of HVM and GHC.

What I do claim, though - and I stand by that claim - is that HVM has the potential to outperform GHC in a significant number of cases (but not all) once sufficiently mature.

I am certainly not enough competent to judge potential of HVM, I can only hope you're right here.

This current thread, on the other hands, is just pointing the obvious - i.e., that GHC outperforms HVM in some cases today. You don't even need to go that far: just write a simple tail-recursive loop, and GHC will probably run it 10x faster than HVM.

I'm perfectly aware of that (through it seems more like 100x than 10x to me). The thing is that radix sort is mentioned in README as the case when HVM should outperform GHC, but it looks like it is correct only for concrete, inefficient (esp. in terms of memory used), implementation. The fact that GHC outperforms HVM in some (I would rather say 'most', once again) cases is indeed obvious, what I find unobvious here is what are some at least somehow real-world cases does HVM outperform GHC, if by cases I understand algorithms/tasks, not concrete (suboptimal) code fragments?

I meant that Both can be represented by a single word when both fields are a base constructor; isn't that how GHC unboxes it in that case? If no, I guess I'm mistaken and I apologize; that was just a guess.

Ah, you mean something like replacing Both Free Free with a new nullary constructor Both_Free_Free? No, GHC isn't doing anything like that, for better or worse.

Yes, we have plans to add seq, although it would need to be linear to make sense in the context of HVM.

That is good to know. I hardly can imagine a performant lazy evaluated runtime without optional strictness. Looking forward to rerunning this benchmark with strict applications once HVM would have those.

HVM still have a very naive allocator that reserves ~3/4 of the available system memory when it starts up, so it'd use that amount of memory regardless of what you ran.

I'm pretty sure that this is the real amount of physical memory used, as it clearly changes (through non-linear) with the size of the tree.

n	maximal residence, according to `time -v`
20	3041 MB
21	4802 MB
22	8320 MB
23	11666 MB
24	11902 MB

In case of GHC it also matches GHC's own rts stats perfectly, through it reserves terabytes of virtual memory on startup.

The benchmarks are already strong, and they're using an old, single-thread, non-compiled version of HVM; this should improve by about 10x soon. So, again, this is an entire proof checker, a complete real-world application, significantly outperforming the alternatives, which were written in Ocaml and Haskell. I can't see how one can see these overwhelming results and still claim with a straight face that HVM has no real-world application.

Sadly I cannot comment much on type checkers benchmarks, as I am completely unaware of implementations of typechecker for dependent types, except that it looks quite impressive indeed. What I find very illustrative, however, is the list-fold benchmark there. Firstly, I cannot reproduce the results with master.

(Fold Nil         c n) = n
(Fold (Cons x xs) c n) = (c x (Fold xs c n))

(Range 0 xs) = xs
(Range n xs) = let m = (- n 1); (Range m (Cons m xs))

(Main n) =
  let size = (* 1000000 n);
  let list = (Range size Nil);
  (Fold list @a@b(+ a b) 0)

import Data.Word
import System.Environment

data List a = Nil | Cons a (List a)

fold :: List a -> (a -> r -> r) -> r -> r
fold Nil         c n = n
fold (Cons x xs) c n = c x (fold xs c n)
-- Otherwise ghc would optimise `fold xs (+)` to `sum xs`, which feels like cheating.
{-# NOINLINE fold #-}

range :: Word64 -> List Word64 -> List Word64
range 0 xs = xs
range n xs = let m = n - 1 in range m (Cons m xs)

main :: IO ()
main = do
  n <- read . head <$> getArgs
  let size = 1000000 * n
  let list = range size Nil
  print $ fold list (+) 0

with HVM 0713836 and GHC 9.2.4 I'm getting following results

n	HVM	GHC	HVM / GHC
1	2.230 s ± 0.063 s	216.1 ms ± 16.1 ms	10.32
2	3.771 s ± 0.098 s	467.8 ms ± 26.9 ms	8.06
4	6.985 s ± 0.165 s	903.2 ms ± 41.5 ms	7.73
8	14.117 s ± 0.978 s	1.869 s ± 0.034 s	7.55
16	28.737 s ± 1.239 s	3.824 s ± 0.048 s	7.52

Which is quite far from 'similar performance on sequential recursive algorithms' if you ask me. I was able to get comparable results by manually optimising fold xs (+) to sum xs

(Sum Nil         a) = a
(Sum (Cons x xs) a) = (Sum xs (+ x a))

(Range 0 xs) = xs
(Range n xs) = let m = (- n 1); (Range m (Cons m xs))

(Main n) =
  let size = (* 1000000 n);
  let list = (Range size Nil);
  (Sum list 0)

import Data.Word
import System.Environment

data List a = Nil | Cons a (List a)

sun :: List Word64 -> Word64 -> Word64
sun Nil a = a
sun (Cons x xs) a = sun xs (x + a)

range :: Word64 -> List Word64 -> List Word64
range 0 xs = xs
range n xs = let m = n - 1 in range m (Cons m xs)

main :: IO ()
main = do
  n <- read . head <$> getArgs
  let size = 1000000 * n
  let list = range size Nil
  print $ sun list 0

n	HVM	GHC	HVM / GHC
1	872.4 ms ± 57.6 ms	132.2 ms ± 10.1 ms	6.60
2	1.041 s ± 0.046 s	313.7 ms ± 16.5 ms	3.32
4	1.248 s ± 0.056 s	610.0 ms ± 15.1 ms	2.05
8	1.731 s ± 0.050 s	1.151 s ± 0.019 s	1.50
16	2.807 s ± 0.101 s	2.464 s ± 0.032 s	1.14
32	4.649 s ± 0.378 s	5.243 s ± 0.044 s	0.88

which on its own looks quite good for HVM: it is even outperforming GHC on 32M. My point is, however, that both programs are utterly silly: they are materialising whole list before starting summation! Why would you do that in lazy language? And once we rewrite the program to generate the list lazily, on-demand, results are quite different.

(Range 0) = Nil
(Range n) = let m = (- n 1); (Cons m (Range m))

(Sum Nil         a) = a
(Sum (Cons x xs) a) = (Sum xs (+ x a))

(Main n) =
  let size = (* 1000000 n);
  let list = (Range size);
  (Sum list 0)

import Data.Word
import System.Environment

data List a = Nil | Cons a (List a)

sun :: List Word64 -> Word64 -> Word64
sun Nil a = a
sun (Cons x xs) a = sun xs (x + a)

range :: Word64 -> List Word64
range 0 = Nil
range n = let m = n - 1 in Cons m (range m)

main :: IO ()
main = do
  n <- read . head <$> getArgs
  let size = 1000000 * n
  let list = range size
  print $ sun list 0

n	HVM	GHC	HVM / GHC
1	890.3 ms ± 72.6 ms	23.9 ms ± 3.2 ms	37.21
2	30.0 ms ± 12.2 ms	1.011 s ± 0.061 s	33.68
4	1.271 s ± 0.050 s	47.3 ms ± 12.0 ms	26.86
8	1.849 s ± 0.226 s	74.5 ms ± 13.4 ms	24.83
16	2.897 s ± 0.062 s	129.2 ms ± 15.4 ms	22.43
32	4.910 s ± 0.200 s	246.6 ms ± 11.8 ms	19.91
64	8.134 s ± 0.754 s	476.6 ms ± 23.1 ms	17.07
128	13.652 s ± 1.212 s	936.2 ms ± 21.2 ms	14.58

Oops...

This is why I'm saying I haven't really seen any properly written programs that HVM is executing faster or at least comparable to GHC. It is definitely possible to construct infinitely similar programs HVM would outperform GHC on; the thing is that, in my opinion, it doesn't matter that much how fast the runtime is on executing deliberately suboptimal programs if it is bad at executing near-optimal ones, which is, as far as I can see, exactly the case for HVM.

Anyway, I think I got the answer I was looking for opening this issue -- seq for HVM is possible, just not implemented yet. I'm closing this, then.

I'm pretty sure that this is the real amount of physical memory used, as it clearly changes (through non-linear) with the size of the tree.

Yes I misunderstood what you was measuring. Point takes.

But looks like HVM as of today just isn't a high-performance runtime according to my definition.

Yes, absolutely. HVM is meant to let us have a sneak peek on what would be possible if this tech was mature enough, and it stands no change of outperforming GHC in all cases, as it doesn't have a fraction of the micro optimizations.

I didn't have the time to investigate your fold implementations and will do that soon (some of these results look like a bug), but one thing that I've noticed, and I hope you take that in consideration, is that most of these optimizations seem to fall in the micro-benchmark trap. As in, seems like GHC is taking radix sort, some of the fold examples, and compiling to tight ASM, in a way that isn't representative of real-world programs, where data comes from the outside, immutable structures and memory allocation dominate, pattern-matching amounts for most computational cost, and garbage-collection becomes an issue.

what I find unobvious here is what are some at least somehow real-world cases does HVM outperform GHC

Ironically, I do feel like GHC has a big edge on these micro-benchmarks, while HVM would perform much better in real world applications, where the parallelization opportunities are vast, the GC-freedom is relevant, and you can't magically turn small algorithms into tight C-like loops or pre-compute a lot of things at compile time. And this can be seen when you acknowledge that Kind performs much better than Agda, a proof checker written in Haskell. Both projects are large enough, and complex enough, that no amount of unboxing, strictness annotations and similar micro optimization wizardry will change their results significantly. Kind is the first real-world application to be developed in HVM, and I believe we'll see that more and more as new applications are developed.

All in all, I get your take and respect it. You're absolutely right that HVM isn't in all cases comparable with GHC today, and I hope it is clear that I never claimed that, but I do believe it already has the edge in important cases, and I believe this will only improve as HVM has a lot of room to grow, while GHC is more or less stable. I'll keep working to push this tech forwards and will keep improving and reporting the results, but I hope you take the arguments I'm giving from the lens of someone who is actually willing to concede, in the same way I'll concede if the ultimate conclusion is that HVM doesn't have a lot of practical value and will always be just a experiment.

Hello again. HVM now has strict arguments. I've also added a bunch of optimizations; see the commit above. You can now emulate strict constructors on HVM by writing an auxiliary function, as follows:

// Strict Constructors
(BOTH !a !b) = (Both a b)
(LEAF !a) = (Leaf a)

These changes nearly doubled the performance of Radix Sort on HVM. On my machine (Apple M1 Max), sorting a list with 2^25 elements on GHC, with your proposed optimizations, takes 12.5 seconds. On HVM, with the new changes, it now takes 5.5 seconds (down from 10 seconds before these optimizations). So, while your changes do cause GHC outperform HVM in a single thread, HVM is faster with 4 or more threads. I hope we'll be able to close that gap as we add more optimizations.

HigherOrderCO / HVM

On the performance of radix sort. #172