Closed countvajhula closed 2 years ago
Your repeat
is probably equivalent to make-list
(except I would expect the stdlib version to be faster :P).
The reason I stayed away from it is the caveat mentioned by make-list
: the values are all exactly the same, so fanning out a mutable vector (for example) gives values that all mutate the same underlying data. The more I think about it, though, using build-list
with (thunk v)
does the same thing with more indirection, so I think using make-list
or repeat
is probably faster for that reason. (Have you tried defining repeat
in terms of make-list
? I wonder if that speeds things up.)
I would also recommend keeping the original clause with [(_ ((~datum fanout) n:number))
for that compile-time nice-ness when the value is a literal, but that's just my opinion.
I am mildly surprised that using (apply values (apply append (repeat n args)))
is faster than repeat-values
, since they are equivalent. Probably the latter isn't being inlined…
Your repeat is probably equivalent to make-list (except I would expect the stdlib version to be faster :P).
Didn't know about make-list
, and I too would expect it to be faster. Oddly enough it appears to be slightly slower than using repeat
. I looked at the implementation and it's this:
(define (make-list n x)
(unless (exact-nonnegative-integer? n)
(raise-argument-error 'make-list "exact-nonnegative-integer?" 0 n x))
(let loop ([n n] [r '()])
(if (zero? n) r (loop (sub1 n) (cons x r)))))
Similar, except that it is tail recursive (should be faster?) and does some error handling. I suppose the net of that ends up being negative? The difference was slight FTR, something like 240ms vs 225ms on average (using (~> (v ...) (-< + count) / round)
on the times from multiple runs :) ). I'll try a tail recursive version of repeat
and see if that improves things even more!
The reason I stayed away from it is the caveat mentioned by make-list: the values are all exactly the same, so fanning out a mutable vector (for example) gives values that all mutate the same underlying data. The more I think about it, though, using build-list with (thunk v) does the same thing with more indirection, so I think using make-list or repeat is probably faster for that reason. (Have you tried defining repeat in terms of make-list? I wonder if that speeds things up.)
Good call re: mutable inputs. My gut feeling is to discourage mutability in flows, so the easy thing here would be to say that inputs are expected to be immutable, or they should at least be converted to immutable versions early in the flow. Do you think there might be cases where we would want to do more to support mutable inputs? Mutating interfaces like vector-set!
don't seem to produce any output, fwiw. In any case, this at least warrants a cautionary note in the docs, for now.
I would also recommend keeping the original clause with [(_ ((~datum fanout) n:number)) for that compile-time nice-ness when the value is a literal, but that's just my opinion.
I'd gladly keep it if we can prove that it helps. In the current benchmark, it's actually slower than using repeat
. I'll try it with larger N and see if that makes a difference. Any other cases to try where we would expect it would perform better?
I am mildly surprised that using (apply values (apply append (repeat n args))) is faster than repeat-values, since they are equivalent. Probably the latter isn't being inlined…
Ah, nice, didn't realize repeat-values
was in private/util.rkt
. It averages something like 270ms on that benchmark, so you must be right re: inlining. I wonder why it doesn't do that.
FYI Latest results: Tail recursive version performs the same as the current version. And with N=10 and N=100, the compile-time version still performs consistently slower (e.g. 400ms vs 600ms).
At the moment, fanout
is rewritten to -<
in that syntactic version, which maybe isn't the right way that dependency should happen since fanout
is simpler. So another thing to try here could be to rewrite -<
in terms of fanout
, and fanout
in Racket, and maybe in this case the compile time version would perform better since it would be more of an "apples to apples" comparison.
Btw, this discussion is relevant for the upcoming compiler work, since as part of that effort, it will be necessary to work out what the "Core Qi" language is, and all other forms will need to be rewritten to Core Qi so that the Qi compiler can focus on optimizing a small set of core forms.
Yeah maybe rewriting tee
in terms of fanout
isn't the best thing. This version:
(define-splicing-syntax-class arguments
#:attributes (args number)
(pattern (~seq arg:expr ...)
#:with args #'(arg ...)
#:attr number (datum->syntax #'args (length (syntax->list #'args)))))
[(_ ((~or (~datum -<) (~datum tee)) args:arguments))
#'(flow (~> (fanout args.number) (== . args)))]
is 4x slower than the current -<
implementation (as reported by $ racket profile/forms.rkt -f tee
).
And also, it doesn't handle the zero args case:
(~> () (-< 1 0))
... should be 1, 0.
But with the fanout + relay based implementation as (~> () (fanout 2) (== 1 0)
, it's an error because the relay expects two inputs but receives none, since (~> () (fanout 2) ...)
produces no values. It may be possible to modify the relay
implementation to handle this case, but that could be tricky (e.g. wrt still providing useful error messages when insufficient inputs are provided), and given that it's slower anyway, it doesn't seem worth pursuing.
So, as it stands, the fanout
based on repeat
seems to be the best candidate. Let me know if you have any thoughts on any of this @benknoble , otherwise I'll plan to merge the version based on repeat
soon.
Ah, the implementation of -<
may shed some light on some perf problems, assuming your measurements don't also include compile-time?
Anyway, we can always swap out the implementation later, as long as this one works :)
Ok, I added a slightly more efficient compile time implementation of fanout for literally indicated N. It does this:
(flow (fanout 2))
→
(λ args (apply values (append args args)))
instead of
(flow (fanout 2))
→
(λ args (apply values (apply append (make-list 2 args))))
... and appears to be slightly faster (maybe 20ms out of ~250ms total). Hard to tell since the variance in the result across runs exceeds what appears to be the difference between the mean performance of the two alternatives, if that makes sense.
assuming your measurements don't also include compile-time?
Tbh I'm not entirely sure what the benchmarks are measuring. My assumption has been that the module is compiled (expanded + compiled), once, prior to running the benchmarks (thousands of times), and so, the result of the benchmark should reflect any compile-time benefits since the running code would already have been transformed. But now that I'm looking, I don't see any bytecode files in the profile/
folder containing the module, so maybe it isn't being compiled at all?? It would have been expanded at least, prior to running, so maybe whether it's compiled or not is moot...
I also changed everything from repeat
to make-list
, since I realized that something like (fanout n)
where n=-1 would cause it to go into an infinite loop. It turns out that the exact-nonnegative-integer?
check in make-list
is advisable to have here after all, so I've removed repeat
from the code, in favor of make-list
.
Interesting series of commits! Yeah, I would expect raco make
or raco setup
prior to running the benchmark to help exclude compile time, but your right that I would also expect one up-front exapnd/compile prior to many runs of the benchmark.
Good call on make-list
👀
Running raco make
results in the benchmarks taking longer. Take a look at this:
[samsara 13:38:43 1714] qi $ racket profile/forms.rkt -f fanout
fanout: 257 ms
[samsara 13:38:53 1715] qi $ raco make profile/forms.rkt
[samsara 15:22:51 1716] qi $ racket profile/forms.rkt -f fanout
fanout: 469 ms
[samsara 15:22:55 1717] qi $ cd profile/
[samsara 15:23:42 1718] profile $ ls
builtin.rkt compiled forms.rkt util.rkt
competitive.rkt forms-base.rkt qi.rkt
[samsara 15:23:43 1719] profile $ ls compiled/
forms-base_rkt.dep forms_rkt.dep util_rkt.dep
forms-base_rkt.zo forms_rkt.zo util_rkt.zo
[samsara 15:23:46 1720] profile $ rm -rf compiled/
[samsara 15:23:58 1721] profile $ ls
builtin.rkt forms-base.rkt qi.rkt
competitive.rkt forms.rkt util.rkt
[samsara 15:24:03 1722] profile $ cd ..
[samsara 15:24:05 1723] qi $ racket profile/forms.rkt -f fanout
fanout: 255 ms
[samsara 15:24:17 1724] qi $
What do we make of it? 😕
Actually, never mind...
Running it as make profile-forms
shows:
$ make profile-forms
...
fanout: 195 ms
...
But running it individually shows:
$ racket profile/forms.rkt -f fanout
fanout: 470 ms
Without compilation (i.e. after rm -rf compiled/
), there is still a discrepancy between the two times but it isn't as stark:
$ make profile-forms
...
fanout: 192 ms
...
$ racket profile/forms.rkt -f fanout
fanout: 282 ms
I did notice that the compiled version runs much faster in real time (e.g. < 1 sec for the compiled version vs several seconds for uncompiled). If the timing result has anything to do with this you'd expect the result to be the opposite of the above (i.e. compiled should be faster), but obviously it's not as simple as that and there's something more subtle going on. Anyway, not anything we need to figure out right now 🙂
Alright I'm going to merge this. Thanks for your input @benknoble !
Sorry for the delayed response—nice to see this land.
I can safely say that RE: profiling, I am out of my depth :)
Summary of Changes
This implements the fix for #32 .
I tried six variations on the implementation (many suggested by @benknoble in #32 ) and profiled each using:
(which, you can also get this command by running
make profile-selected-forms
)Using
repeat
instead ofbuild-list
seemed to be the fastest -- faster even than the original syntactic version for numbers! I'm not sure why this might be, and I also haven't tried it on largeN
, where maybe the results would be different. Just putting it up for now for any initial feedback.All of the variations are in the flow.rkt file and are currently commented out, except the one that's performing the fastest at the moment. For reference, this is the benchmark that is run with the above command. Just like all the other benchmarks currently in that file, it is rudimentary, just a "smoke" benchmark (as in "smoke test") to get some rough sense of performance (ideas for improvement welcome!).
Public Domain Dedication