Syncleus / aparapi

The New Official Aparapi: a framework for executing native Java and Scala code on the GPU.
http://aparapi.com
Apache License 2.0
464 stars 59 forks source link

Java Alternative Algorithm does not work for arbitrary NDRanges #142

Open Helios85 opened 5 years ago

Helios85 commented 5 years ago

Aparapi "falls back" to a "Java Alternative Algorithm" (JAA) in cases where the Java byte code cannot be translated to OpenCL code. The JAA, however, only works if the NDRange is an exact power of 2. The following example illustrates the problem:


public class JAATest {
    public static void main(String[] args) {
        int[] a = {1, 2, 3, 4, 5, 6, 7};

        new Kernel() {
            @Override public void run() {
                System.out.println(); // induces fallback
                a[getGlobalId()] = 0;
            }
        }.execute(a.length);

        System.out.println(java.util.Arrays.toString(a));
    }
}

Execution yields the following incorrect output:

[0, 0, 0, 0, 5, 6, 7]
NorbiPeti commented 5 years ago

It seems to be multiples of 4, not powers of 2. Anything below 4 is 0, below 16 is 12.

freemo commented 5 years ago

Hmmm that is interesting. I will have to investigate this issue again and see if that provides any new clues.

On Wed, Mar 6, 2019 at 1:04 PM NorbiPeti notifications@github.com wrote:

It seems to be multiples of 4, not powers of 2. Anything below 4 is 0, below 16 is 12.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/Syncleus/aparapi/issues/142#issuecomment-470082868, or mute the thread https://github.com/notifications/unsubscribe-auth/AC5JAscx41TgYJOkMnzY71hvzbJYO-Bqks5vT67VgaJpZM4Y2SiB .

rkraneis commented 4 years ago

Same problem here. If NDRange is a multiple of available threads (12 in my case), all is fine. If not, then the results are (partially) incorrect. Tested by fiddling with ArrayTest (changing SIZE constant and forcing Java fall-back by adding a println of current id). Edit: validated a bit less hacky by explicitly setting THREAD_POOL as the preferred device using KernelManager.

rkraneis commented 4 years ago

SIZE = 6 (less than number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5]
target[0]: [99, 99, 99, 99, 99, 99]
expected[1]: [1, 2, 3, 4, 5, 6]
target[1]: [99, 99, 99, 99, 99, 99]
expected[2]: [2, 3, 4, 5, 6, 7]
target[2]: [99, 99, 99, 99, 99, 99]
expected[3]: [3, 4, 5, 6, 7, 8]
target[3]: [99, 99, 99, 99, 99, 99]
expected[4]: [4, 5, 6, 7, 8, 9]
target[4]: [99, 99, 99, 99, 99, 99]
expected[5]: [5, 6, 7, 8, 9, 10]
target[5]: [99, 99, 99, 99, 99, 99]

SIZE = 12 (number of threads)

Running com.aparapi.runtime.ArrayTest
expected[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
target[0]: [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
expected[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
target[1]: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]
expected[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
target[2]: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]
expected[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
target[3]: [3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]
expected[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
target[4]: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]
expected[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
target[5]: [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]
expected[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
target[6]: [6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17]
expected[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
target[7]: [7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18]
expected[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
target[8]: [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]
expected[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
target[9]: [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20]
expected[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
target[10]: [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21]
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22]

SIZE = 13 (number of threads + 1)

...
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

SIZE = 23 ( 2 * number of threads - 1)

...
expected[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
target[11]: [11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]
expected[12]: [12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34]
target[12]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[13]: [13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
target[13]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[14]: [14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36]
target[14]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[15]: [15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37]
target[15]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[16]: [16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38]
target[16]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[17]: [17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39]
target[17]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[18]: [18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40]
target[18]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[19]: [19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41]
target[19]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[20]: [20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42]
target[20]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[21]: [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43]
target[21]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]
expected[22]: [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44]
target[22]: [99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99, 99]

From this it seems as if the final non-full batch is not calculated at all (arrays are initialized with 99 in this test).

rkraneis commented 4 years ago

Setting SEQUENTIAL as the preferred device would work as a fallback ...

rkraneis commented 4 years ago

This seems to be boiling down to execute(int) using createRange which uses device information to then create a range global:23 local:(derived)12 ... When manually creating the range with a null device and yielding global:23 local:(derived)23 also the thread pool execution leads to correct results. So localSize0 will then be only 12 during thread pool creation in KernelRunner.

rkraneis commented 4 years ago

This optimization(?)

if (_device == JavaDevice.THREAD_POOL) {
  withoutLocal.setLocalSize_0(Runtime.getRuntime().availableProcessors());
  withoutLocal.setLocalIsDerived(true);
  return withoutLocal;
} else 

is only happening for create, but not for create2D nor for create3D. Removing it makes everything work. But there probably was originally a reason for adding it ...

Edit: it also contradicts the javadoc, which reads

The groupsize will be chosen such that _localWidth > 0 && _localWidth < MAX_GROUP_SIZE && _globalWidth % _localWidth==0 is true

which is what the thread scheduling relies upon.

CoreRasurae commented 3 years ago

@rkraneis Thanks for your work on finding the root cause. The reason for that code being there is that Java performance is likely to be best if the number of Java Thread Pool threads doesn't exceed the CPU available cores (eventually Hyper-Threading siblings are also included in the count which is undesirable for High Performance Computing workloads, anyway). So either this is removed, since it can violate the rule _globalWith % _localWidth == 0, or the logic must be modified so that it respects such logic. @freemo What is your opinion here?

CoreRasurae commented 3 years ago

@Helios85 Can you confirm @rkraneis theory for the root cause of this bug?

Helios85 commented 3 years ago

@CoreRasurae I can indeed confirm that the Java Alternative Algorithm only works if NDRange is a multiple of the available logical processors. I tested this with two machines, one with 4, the other with 8 logical processors. In the first case, NDRange had to be a multiple of 4 for the alternative algorithm to work correctly, in the second case NDRange had to be a multiple of 8. @rkraneis Congratulations for finding the cause of this problem!

CoreRasurae commented 3 years ago

@Helios85 Thanks for confirming this.

KaiGerdMueller commented 10 months ago

Tested whether the bug is still there on my device using the provided class JAATest. It printed [1, 2, 3, 4, 5, 6, 7], so works just fine. Tested on Ryzen with surely not 7 kernels. Seems like the bug may be gone, verification would be good tough.