drym-org / qi

An embeddable flow-oriented language.
59 stars 12 forks source link

Restore form performance #89

Closed countvajhula closed 1 year ago

countvajhula commented 1 year ago

Summary of Changes

Restore performance of Qi forms to pre-stratification levels.

See Return to baseline performance for more information.

Public Domain Dedication

(Why: The freely released, copyright-free work in this repository represents an investment in a better way of doing things called attribution-based economics. Attribution-based economics is based on the simple idea that we gain more by giving more, not by holding on to things that, truly, we could only create because we, in our turn, received from others. As it turns out, an economic system based on attribution -- where those who give more are more empowered -- is significantly more efficient than capitalism while also being stable and fair (unlike capitalism, on both counts), giving it transformative power to elevate the human condition and address the problems that face us today along with a host of others that have been intractable since the beginning. You can help make this a reality by releasing your work in the same way -- freely into the public domain in the simple hope of providing value. Learn more about attribution-based economics at drym.org, tell your friends, do your part.)

countvajhula commented 1 year ago

Notes from today's meeting: A Loophole in Qi Space. Esp @benknoble you may be interested in this "loophole" that Michael came up with as an alternative to some of the options we've talked about before ("restorative/peephole optimizations" vs inflating the core language).

benknoble commented 1 year ago

I attempted to do some benchmarking of none?'s implementation, and found that (a) for arities of less than 100k, the timing of any implementation I used was sub-millisecond; (b) in most cases, using for/and over not was faster than composing not with the for/or version of any?; and (c) the overhead of threads was too great to allow "racing" both versions to get the best of both worlds.

I tried with different distributions of the value #f and random numbers in [0,1) (called the "frequency") and with different orders of thread creation. I was actually quite surprised that for/or didn't win out significantly for lower frequencies of #f and for/and in higher frequencies. The for/or does clearly do far better than for/and at frequency 0, but only "dramatically" so for n = 100k. At higher orders of magnitude, they appear maybe roughly the same?

The bars show the average of 5 runs, with error bars the size of the standard deviation.

Note that the y-axis is on a logarithmic (base-10) transform; otherwise, you wouldn't be able to see all sizes on the same graph.

bench

countvajhula commented 1 year ago

👁️ 🍭 That's great data! We can aim to discuss it at this week's Qi meetup (happening on Thursday just FYI). Those charts seriously look amazing, I'm curious how you generated those. It could be useful to add the recipe to the Developer's Guide.

benknoble commented 1 year ago

Hand-typed from the original; forgive typos 😅

#lang racket

(require plot file/convertible math/statistics pict)

(define (write-pict-to-svg p f)
  (with-output-to-file f
    (thunk
      (write-bytes (convert p 'svg-bytes)))))

(define (not-o-any . args)
  (not (for/or ([arg (in-list args)])
         arg)))

(define (for/and-over-not . args)
  (for/and ([arg (in-list args)])
    (not arg)))

(define (race-threads1 . args)
  (define me (current-thread))
  (define t1 (thread (thunk (thread-send me (apply not-o-any args)))))
  (define t2 (thread (thunk (thread-send me (apply for/and-over-not args)))))
  (begin0 (thread-receive)
    (kill-thread t1)
    (kill-thread t2)))

(define (race-threads2 . args)
  (define me (current-thread))
  (define t2 (thread (thunk (thread-send me (apply for/and-over-not args)))))
  (define t1 (thread (thunk (thread-send me (apply not-o-any args)))))
  (begin0 (thread-receive)
    (kill-thread t1)
    (kill-thread t2)))

(define (generate n freq)
  (build-list n (λ (_i)
                  (define n (random))
                  (and (>= n freq) n))))

(define ns (list #e1e5 #e1e6 #e1e7))
(define freqs (list 0.0 0.3 0.7 1.0))
(define fs (list not-o-any for/and-over-not race-threads1 race-threads2))

(struct experiment [f n freq stats] #:transparent)
(struct stats [min mean max stddev] #:transparent)

(define experiments
  (map (match-lambda [(list n freq f) (experiment f n freq #f)])
       (cartesian-product ns freqs fs)))

(define n-trials 5)

(define/match (run-experiment _e)
  [((experiment f n freq #f))
   (define times
     (for/list ([_i n-trials])
       (define-values (_res _cpu real _garbage) (time-apply f (generate n freq)))
       real))
   (experiment f n freq (make-stats times))])

(define (make-stats times)
  (define μ (mean times))
  (define σ (stddev/mean μ times))
  (stats (apply min times) μ (apply max times) σ))

(define (experiments->renderers es skip x-min)
  (define (x i) (+ 0.5 x-min (* i skip)))
  (define label (object-name (experiment-f (first es))))
  (define es-by-n (sort es < #:key experiment-n))
  (define histogram
    (discrete-histogram
      (for/list ([e (in-list es-by-n)])
        (match-define (experiment _ n _ (stats _ mean _ _ )) e)
        (vector (format "n = ~a" n) mean))
      #:skip skip #:x-min x-min #:label (~a label) #:color (add1 x-min)))
  (define annotations
    (list
      (error-bars
        (for/list ([(e i) (in-indexed (in-list es-by-n))])
          (match-define (stats _ mean _ stddev) (experiment-stats e))
          (vector (x i) mean stddev)))
      #;
      (for/list ([(e i) (in-indexed (in-list es-by-n))])
        (match-define (stats min _ max _) (experiment-stats e))
        (list (point-label (vector (x i) min) "min")
              (point-label (vector (x i) max) "max")))))
  (list histogram annotations))

(module+ main
  (random-seed 0)
  (displayln "running experiments")
  (define results (time (map run-experiment experiments)))
  (displayln "plotting")
  (define results-by-freq (group-by experiment-freq results))
  (define results-by-freq-by-f (map (curry group-by experiment-f) results-by-freq))
  (define results-by-freq-by-f-sorted-by-freq
    (sort results-by-freq-by-f < #:key (compose1 experiment-freq caar)))

  (define skip (length fs))
  (define p
    (apply vc-append
           (for/list ([group (in-list results-by-freq-by-f-sorted-by-freq)])
             (define freq (experiment-freq (caar group)))
             (parameterize ([plot-y-transform (axis-transform-bound log-transform 0.01 +inf.0)]
                            [plot-y-ticks (log-ticks)])
               (plot-pict
                 (for/list ([(exps i) (in-indexed group)])
                   (experiments->renderers exps skip i))
                 #:title (format "Frequency: ~a" freq)
                 #:width (* 4 (plot-width))
                 #:y-min 0.001
                 #:x-label "n"
                 #:y-label "t")))))
  (show-pict p)
  (write-pict-to-svg p "bench.svg"))
countvajhula commented 1 year ago

@benknoble We reviewed the data and after playing around with the expansions of the two alternatives we seemed to conclude that they should perform almost identically in practice since they both short-circuit and have very similar expansions. Lmk if you think we've misunderstood though!

Notes from today

Also the code works perfectly and took almost ten minutes to run on my laptop. I'll add it to the wiki soon 👍

benknoble commented 1 year ago

We reviewed the data and after playing around with the expansions of the two alternatives we seemed to conclude that they should perform almost identically in practice since they both short-circuit and have very similar expansions. Lmk if you think we've misunderstood though!

No, I think that's right.

Notes from today

Thanks for the link. I think for partition, the easiest way to to make it non-core would be to have a compiler optimization that recognizes nested sieves and converts it to partition-values (so that partition -> nested sieve -> partition-values); this also optimizes human-written nested sieves, if there are any.

Also the code works perfectly and took almost ten minutes to run on my laptop. I'll add it to the wiki soon 👍

Great! Yeah, it isn't fast: it would be better to generate the data once, save it, and then chart whatever you want from there. I think it was roughly similar on my machine (I recall using 2 frequencies being 2–3 minutes).

countvajhula commented 1 year ago

Thanks for the link. I think for partition, the easiest way to to make it non-core would be to have a compiler optimization that recognizes nested sieves and converts it to partition-values (so that partition -> nested sieve -> partition-values); this also optimizes human-written nested sieves, if there are any.

Great idea! We can try it in the first optimizations PR.

countvajhula commented 1 year ago

This PR is ready -- any input welcome! As it will be reviewed at the end anyway as part of merging the integration branch into main, there's no pressing need for review at this stage and we can just merge it in a few days in any case.

countvajhula commented 1 year ago

Benchmarking recipe added to the wiki 🙂