exaloop / codon

A high-performance, zero-overhead, extensible Python compiler using LLVM
https://docs.exaloop.io/codon
Other
13.96k stars 498 forks source link

The @omp.ordered decorator stalling #444

Closed marioroy closed 11 months ago

marioroy commented 11 months ago

Branch: bugfixes-2023-08

I completed C and Codon parallel segmented practical-sieve demonstrations, but not yet published in Gist due to Codon failing. The following is a mini-example for reproducing the issue.

import openmp as omp
from sys import argv

_SIEVE_SIZE = 510510 * 30

@omp.ordered
def output_primes(low, high):
    print(f"{low}, {high}")

def practicalsieve(limit):
    @par(schedule='dynamic', chunk_size=1, ordered=True)
    for low in range(1, limit + 1, _SIEVE_SIZE):
        high = low + _SIEVE_SIZE - 1
        if high > limit: high = limit
        output_primes(low, high)

if __name__ == "__main__":
    limit = int(argv[1]) if len(argv) > 1 else 1000
    practicalsieve(limit)

Testing:

Try OMP_NUM_THREADS=4 if 3 is working. The stall issue is random.

$ codon build -release test.py
$ OMP_NUM_THREADS=1 ./test 1000000
1, 1000000
$ OMP_NUM_THREADS=2 ./test 1000000
1, 1000000
$ OMP_NUM_THREADS=3 ./test 1000000    <- stalled
$ OMP_NUM_THREADS=1 ./test 100000000
1, 15315300
15315301, 30630600
30630601, 45945900
45945901, 61261200
61261201, 76576500
76576501, 91891800
91891801, 100000000

OMP_NUM_THREADS=3 ./test 100000000
91891801, 100000000
1, 15315300                    <- not orderly
15315301, 30630600
76576501, 91891800
30630601, 45945900
45945901, 61261200
61261201, 76576500

$ OMP_NUM_THREADS=9 ./test 100000000
91891801, 100000000
15315301, 30630600
30630601, 45945900
76576501, 91891800
45945901, 61261200
1, 15315300
                               <- stalled
marioroy commented 11 months ago

Branch: bugfixes-2023-08

I tried static as well.

@par(schedule='static', chunk_size=1, ordered=True)
$ codon build -release test.py
$ OMP_NUM_THREADS=1 ./test 100000000  <- orderly :)
1, 15315300
15315301, 30630600
30630601, 45945900
45945901, 61261200
61261201, 76576500
76576501, 91891800
91891801, 100000000

$ OMP_NUM_THREADS=9 ./test 100000000  <- stalled

$ OMP_NUM_THREADS=9 ./test 100000000  <- one line output, stalled
91891801, 100000000
^C

$ OMP_NUM_THREADS=9 ./test 100000000  <- oh, not again...
1, 15315300
45945901, 61261200
76576501, 91891800
^C

$ OMP_NUM_THREADS=7 ./test 100000000  <- not orderly
91891801, 100000000
15315301, 30630600
30630601, 45945900
61261201, 76576500
1, 15315300
45945901, 61261200
76576501, 91891800
arshajii commented 11 months ago

Hi @marioroy, this is happening because Codon is not seeing range(1, limit + 1, _SIEVE_SIZE) as an "imperative" for-loop that it can apply loop-parallelism to, since the step size is a variable. Therefore it's actually using task-based parallelism which is incompatible with "ordered".

We should probably give a warning or something here, but you can fix it by making _SIEVE_SIZE a compile-time constant: _SIEVE_SIZE: Static[int] = 510510 * 30. Let me know if that works!

marioroy commented 11 months ago

Branch: bugfixes-2023-08

Thank you, @arshajii. Your suggestion is working 100%. I was at a lost here :)

Using my new demo, validating Codon output against C is passing 100% no matter the number of threads specified.

# C
$ OMP_NUM_THREADS=8 ./practicalsieve_mp1 1000000000 | md5sum
50847534 primes found.    <- count is sent to STDERR output
92c178cc5bb85e06366551c0ae7e18f6  -

# Codon
$ OMP_NUM_THREADS=8 ./practicalsieve_mp2 1000000000 | md5sum
50847534 primes found.
92c178cc5bb85e06366551c0ae7e18f6  -

I tried also limit 10 billion, generating 4.719 GB of data through the pipe.

# C
$ OMP_NUM_THREADS=8 ./practicalsieve_mp1 1000000000 | md5sum
455052511 primes found.
95ded65c9ddca18e1df966ba2a2b373a  -

# Codon
$ OMP_NUM_THREADS=8 ./practicalsieve_mp2 1000000000 | md5sum
455052511 primes found.
95ded65c9ddca18e1df966ba2a2b373a  -
marioroy commented 11 months ago

I wish for the ability to compute the step size for range. Doing so, changes to task scheduling; losing OpenMP ordered capability. For better performance, big numbers benefit using a bigger step size for range. Not fun is Codon changing to task scheduling behind the scene. Especially, when the step size is computed one time in the code.

So, I had to revert the logic, improving performance for big numbers.

Is there a way to dynamically set the step size for range and not lose OpenMP ordered capability? IMHO, I find this aspect of Codon limiting. Well, it has been great fun experimenting with Codon, until reaching this roadblock and range not supporting 64-bit unsigned integers.

arshajii commented 11 months ago

Hi @marioroy -- often you can achieve the same thing with constant step-size by just transforming the index. Example:

for i in range(a, b, c):
    print(i)

to

r = range(a, b, c)
for i in range(len(r)):
    print(r[i])

Similar to unsigned integers, as the loop can be over int but you can transform them as needed to u64 in the body. Let me know if this works!

marioroy commented 11 months ago

Thank you, @arshajii -- that is most helpful. Computing the step size is beneficial for the parallel prime-sieve application. The performance is restored for the Codon demonstration.

Regarding unsigned integers, I cannot use range. That means, having to create a generator supporting unsigned integers.

r = range(9223372036854775800, 9223372036854775800 + 8, 1)
print(len(r))
Python: 8
Codon: 0

Please consider my vote (+1) for a urange function in Codon.

r = range(a, b, c)   # int
r = urange(a, b, c)   # u64
r = range128(a, b, c)   # Int[128]
r = urange128(a, b, c)   # UInt[128]
marioroy commented 11 months ago

Thank you, @elisbyberi. Your suggestion works.

for i in range(17446744073000000000, 17446744073000000010):
    print(UInt[64](i))