enh: mitigate regression induced by the parallel implementation

imrn99 commented 2 weeks ago

I noticed that a huge regression in performance happened with the changes of #201.

Observed changes

while we can expect a loss of speed due to sync overhead, the observed changes were absurdly important:

bench	after (fdb2ed95d59f6c3d832a150ad826a4e41a45c595)	before (b3e6c201d1cddc3fc097b795f0b21deaf5633476)
`builder-time`	45ms	1.34ms
`grisubal-time`	144ms	5ms

Cause

I haven't fully identified the cause(s) yet, but my first impression is that the initalization of TVars is very costly, see the snippet below from the stm crate source code:

impl<T> TVar<T>
    where T: Any + Sync + Send + Clone
{
    /// Create a new `TVar`.
    pub fn new(val: T) -> TVar<T> {
        TVar {
            control_block: VarControlBlock::new(val), // this right here
            _marker: PhantomData,
        }
    }

    // ...
}

impl VarControlBlock {
    /// create a new empty `VarControlBlock`
    pub fn new<T>(val: T) -> Arc<VarControlBlock>
        where T: Any + Sync + Send
    {
        let ctrl = VarControlBlock {
            waiting_threads: Mutex::new(Vec::new()), 
            dead_threads: AtomicUsize::new(0),
            value: RwLock::new(Arc::new(val)),
        };
        Arc::new(ctrl)
    }

    // ...
}

imrn99 commented 1 week ago

I'm running experiment with a fork of the stm crate. I tried replacing some internal data structures with no-alloc variants (using heapless & arrayvec). There seem to be little to no change in performance, although this needs confirmation with benchmarks on more consistent machines.

There is one structure that I haven't been able to replace though: a BTreeMap used to keep track of accessed variable in a Transaction. There are no available equivalent atm (heapless's LinearMap lacks one method).

Essentially, TVar's inits and operations take much more time (several order of magnitudes for inits) than atomics (even Atomic<T> from the atomic crate). This raises multiple questions:

how much could we improve the implementation? how performant can we make a new implementation?
is the tradeoff worth it? guaranteeing operation correctness is very important, is it worth downgrading performance that much?
can we consider having multiple implementations?

imrn99 commented 1 week ago

perf report of the sequential impl, benching the builder binary:

Screenshot_2024-11-06_14-57-14

perf report of the parallel impl:

Screenshot_2024-11-06_14-55-49

it seems like the stm variables require that many more cycles to run; for example, the number of cycles dedicated to computing orbits goes from 10% down to less than 1%

imrn99 commented 1 week ago

perf stat -d of the builder binary, seq impl:

 Performance counter stats for './target/profiling/builder 1024 1024':

            137,67 msec task-clock                       #    0,997 CPUs utilized
                 2      context-switches                 #   14,527 /sec
                 1      cpu-migrations                   #    7,264 /sec
            36 951      page-faults                      #  268,402 K/sec
       742 525 510      cycles                           #    5,394 GHz                         (70,98%)
       151 260 278      stalled-cycles-frontend          #   20,37% frontend cycles idle        (71,01%)
     2 331 338 369      instructions                     #    3,14  insn per cycle
                                                  #    0,06  stalled cycles per insn     (70,91%)
       418 462 962      branches                         #    3,040 G/sec                       (71,55%)
         5 624 721      branch-misses                    #    1,34% of all branches             (72,30%)
       699 688 406      L1-dcache-loads                  #    5,082 G/sec                       (72,01%)
         9 674 762      L1-dcache-load-misses            #    1,38% of all L1-dcache accesses   (71,24%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       0,138110185 seconds time elapsed

       0,085484000 seconds user
       0,051671000 seconds sys

parallel impl:

 Performance counter stats for 'taskset -c 0 ./target/profiling/builder 1024 1024':

          2 723,26 msec task-clock                       #    0,998 CPUs utilized
               137      context-switches                 #   50,307 /sec
                 1      cpu-migrations                   #    0,367 /sec
           712 863      page-faults                      #  261,768 K/sec
    14 732 320 056      cycles                           #    5,410 GHz                         (71,36%)
     2 820 127 554      stalled-cycles-frontend          #   19,14% frontend cycles idle        (71,43%)
    44 049 373 349      instructions                     #    2,99  insn per cycle
                                                  #    0,06  stalled cycles per insn     (71,43%)
     8 484 034 434      branches                         #    3,115 G/sec                       (71,46%)
       110 288 231      branch-misses                    #    1,30% of all branches             (71,49%)
    13 633 379 729      L1-dcache-loads                  #    5,006 G/sec                       (71,45%)
       227 469 230      L1-dcache-load-misses            #    1,67% of all L1-dcache accesses   (71,39%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses

       2,728805716 seconds time elapsed

       1,909099000 seconds user
       0,798234000 seconds sys

LIHPC-Computational-Geometry / honeycomb

enh: mitigate regression induced by the parallel implementation #215

Observed changes

Cause