exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
13.96k stars 498 forks source link

Excessive power-consumption for threads waiting to acquire mutex #456

Closed marioroy closed 9 months ago

marioroy commented 10 months ago

I completed my journey learning Codon.

Something I observed are threads in a busy CPU loop waiting to acquire the lock. To reproduce, the C and Codon demonstrations live inside the examples folder. To diagnose, run primes1, 2 or primes3, 4 and monitor top. Also, wattage if you can. The power-consumption wasted is greater than 100 watts on a big box, simply for threads waiting their turn.

This is merely a demonstration to elevate the needle in the hay stack. Typically, just few threads are enough to print primes. Be sure to direct output to /dev/null. Pressing Ctrl-C will end the process.

# C demo  Top reports 176%
OMP_NUM_THREADS=64 primes1 1e10 -p >/dev/null   # 207 watts

# Codon demo   Top reports 6374%
OMP_NUM_THREADS=64 primes2 1e10 -p >/dev/null   # 315 watts :(

The behavior is correct for the C/OpenMP demonstration with regards to low CPU utilization and power consumption. Threads wait their turn to print primes, serially and orderly. One should not see 6400% CPU utilization for this use case.

Simulation: Imagine a large data center using Codon to run parallel on thousands of compute nodes. Why do threads involve busy CPU loop while waiting to acquire the lock? The C OpenMP demonstration shows that it's possible for threads to wait without the busy CPU loop.

marioroy commented 10 months ago

The following is a tiny example. IMO, seeing high CPU utilization for threads simply waiting their turn to enter critical or ordered function is not okay. The CPU utilization is 700% initially, after 4 seconds 600%, after 4 seconds 500%, ... until all threads have run. The same behavior is seen for @omp.ordered.

import openmp as omp
from time import sleep

@omp.critical
def task(i):
    print(i)
    sleep(4)

@par(schedule='static', chunk_size=1, num_threads=8)
for i in range(8):
    task(i)

Imagine the power consumption from threads simply waiting.

marioroy commented 10 months ago

This issue appears to be LLVM. I imported omp_init_lock, omp_destroy_lock, omp_set_lock, and omp_unset_lock. The same behavior is observed.

class struct_omp_lock:
    _lk: Ptr[cobj]

from C import omp_init_lock(struct_omp_lock) -> None
from C import omp_destroy_lock(struct_omp_lock) -> None
from C import omp_set_lock(struct_omp_lock) -> None
from C import omp_unset_lock(struct_omp_lock) -> None

from time import sleep

def task(i):
    print(i)
    sleep(4)

writelock = struct_omp_lock()
omp_init_lock(writelock)

@par(schedule='static', chunk_size=1, num_threads=8)
for i in range(8):
    omp_set_lock(writelock)
    task(i)
    omp_unset_lock(writelock)

omp_destroy_lock(writelock)
marioroy commented 9 months ago

Closing this issue. Not a Codon bug.