cisco / ChezScheme

Chez Scheme
Apache License 2.0
6.89k stars 983 forks source link

linear slowdown with parallel execution? #786

Open namin opened 6 months ago

namin commented 6 months ago

Hi,

I defined pcall like Kent Dybvig on slide 10 of https://web.archive.org/web/20170626072601/http://www.ccs.neu.edu/events/wand-symposium/talks/mitchfest-09-dybvig.pdf and was surprised by how it seems to have a linear slowdown with the number of expressions in parallel.

(define-syntax pcall
  (syntax-rules ()
    [(_ e0 e1 ...)
     (let ([m (make-mutex)]
           [c (make-condition)]
           [ls (list (lambda () e0) (lambda () e1) ...)]
           [n (length '(e0 e1 ...))])
       (with-mutex m
         (do ([ls ls (cdr ls)])
             ((null? ls))
           (fork-thread
            (lambda ()
              (set-car! ls ((car ls)))
              (with-mutex m
                (set! n (- n 1))
                (when (= n 0) (condition-signal c))))))
         (condition-wait c m))
       (apply (car ls) (cdr ls)))]))

(define thunk (lambda () (let loop ((i 0)) (if (= i 3000000000) (lambda args 'done) (loop (+ i 1))))))

(printf "1\n")
(time (pcall (thunk)))

(printf "3\n")
(time (pcall (thunk) (thunk) (thunk)))

(printf "10\n")
(time (pcall (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk)))

(printf "20\n")
(time (pcall (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk) (thunk)))
Chez Scheme Version 9.6.4
Copyright 1984-2023 Cisco Systems, Inc.

1
(time (pcall (thunk)))
    no collections
    0.000057763s elapsed cpu time
    5.575509000s elapsed real time
    84336 bytes allocated
3
(time (pcall (thunk) ...))
    no collections
    0.000065180s elapsed cpu time
    6.022160000s elapsed real time
    252464 bytes allocated
10
(time (pcall (thunk) ...))
    no collections
    0.000395990s elapsed cpu time
    7.696845000s elapsed real time
    840912 bytes allocated
20
(time (pcall (thunk) ...))
    no collections
    0.000328870s elapsed cpu time
    11.988880000s elapsed real time
    1681552 bytes allocated

I am using a modified version which executes expressions in parallel and returns the first value that succeeds, stopping the others (each thunk is wrapped in an engine). https://github.com/metareflection/synthesis-scheme/blob/main/threads.scm It also suffers from this linear slowdown.

Is this linear slowdown for parallel execution expected?

Thanks. Best wishes, ~n

jltaylor-us commented 6 months ago

This is almost certainly not a linear "slowdown", but rather a polynomial growth in overhead. The overhead also appears to grow with the amount of work done in thunk (based on running some more tests with an extra zero in the loop condition), which I would expect if the work in thunk included some contention for a shared resource. That could easily happen in non-contrived code, since there are internal mutexes for allocation and garbage collection. In this case, however, it looks like the code inside thunk performs no allocation, and the results of time show that no GC occurred. I haven't quite managed to figure out whether the "trap check" happening in each thread that determines whether to fire a GC is a synchronization point if no allocation is being done. That would explain the extra overhead, but seems like it shouldn't be necessary when no allocation has happened.