Closed marioroy closed 9 months ago
The following is a tiny example. IMO, seeing high CPU utilization for threads simply waiting their turn to enter critical or ordered function is not okay. The CPU utilization is 700% initially, after 4 seconds 600%, after 4 seconds 500%, ... until all threads have run. The same behavior is seen for @omp.ordered
.
import openmp as omp
from time import sleep
@omp.critical
def task(i):
print(i)
sleep(4)
@par(schedule='static', chunk_size=1, num_threads=8)
for i in range(8):
task(i)
Imagine the power consumption from threads simply waiting.
This issue appears to be LLVM. I imported omp_init_lock, omp_destroy_lock, omp_set_lock, and omp_unset_lock. The same behavior is observed.
class struct_omp_lock:
_lk: Ptr[cobj]
from C import omp_init_lock(struct_omp_lock) -> None
from C import omp_destroy_lock(struct_omp_lock) -> None
from C import omp_set_lock(struct_omp_lock) -> None
from C import omp_unset_lock(struct_omp_lock) -> None
from time import sleep
def task(i):
print(i)
sleep(4)
writelock = struct_omp_lock()
omp_init_lock(writelock)
@par(schedule='static', chunk_size=1, num_threads=8)
for i in range(8):
omp_set_lock(writelock)
task(i)
omp_unset_lock(writelock)
omp_destroy_lock(writelock)
Closing this issue. Not a Codon bug.
I completed my journey learning Codon.
Something I observed are threads in a busy CPU loop waiting to acquire the lock. To reproduce, the C and Codon demonstrations live inside the examples folder. To diagnose, run primes1, 2 or primes3, 4 and monitor top. Also, wattage if you can. The power-consumption wasted is greater than 100 watts on a big box, simply for threads waiting their turn.
This is merely a demonstration to elevate the needle in the hay stack. Typically, just few threads are enough to print primes. Be sure to direct output to
/dev/null
. Pressing Ctrl-C will end the process.The behavior is correct for the C/OpenMP demonstration with regards to low CPU utilization and power consumption. Threads wait their turn to print primes, serially and orderly. One should not see 6400% CPU utilization for this use case.
Simulation: Imagine a large data center using Codon to run parallel on thousands of compute nodes. Why do threads involve busy CPU loop while waiting to acquire the lock? The C OpenMP demonstration shows that it's possible for threads to wait without the busy CPU loop.