Syncleus / aparapi

The New Official Aparapi: a framework for executing native Java and Scala code on the GPU.
http://aparapi.com
Apache License 2.0
466 stars 59 forks source link

Huge AtomicInteger arrays #138

Open gpeabody opened 6 years ago

gpeabody commented 6 years ago

@CoreRasurae I use 2 AtomicInteger arrays for 2 million indexes each. This causes about 200 milliseconds to prepare and extract these arrays. Is it possible to avoid preparation and extraction at each execution of the kernel? I do not need to transfer these arrays to the host, only use them on execution.

@gpeabody Please avoid from creating questions on closed topics, instead open a new issue with your question. Anyway regarding your question and without having any clue on how you implemented the kernel, I would suggest for you to place the AtomicInteger arrays in LocalMemory, that way they will only be initialized inside the kernel and there is no transfer overhead.

gpeabody commented 6 years ago

@CoreRasurae What is the maximum size of this array I can put in the local memory? I've tried, but 2 million cause a memory overflow error. I need to use global GPU memory (not a host), since all threads work with these arrays, and not just within the workgroup.

CoreRasurae commented 6 years ago

@gpeabody The maximum size of the array is defined by the local memory size. Which typically ranges from 32KiB to 48KiB depending on the GPU. So for sure 2 million entries, won't fit, however they're supposed to be split across the workgroups. If you need to have multiple workgroups with inter-workgroup atomic accesses then yes, you can only use global memory based atomics. In such case there is no way to avoid the data transfers, but you can reduce them, by either, using explicit data transfers from and to the kernel, or avoiding Aparapi from identifying that you are writing to the atomic arrays. For the first case, you can set Aparapi into explicit transfer mode if you use Kernel.setExplicit(...), however I believe Aparapi is missing some glue code to perform the transfer with Kernel.set(...) and Kernel.get(...). For the second case, you can use helper methods to access the atomics without having any direct reference inside the Kernel.run() method, that way Aparapi will believe that you are not changing the values in the array, and thus they are input only. You will always need to transfer the initial values to the kernel, but you don't need to transfer the results back from the kernel, which can save execution time.

gpeabody commented 6 years ago

The maximum size of the array in my case is 8192 @Local AtomicInteger[] atomics = new AtomicInteger[8192]; But this is not a solution, since only the workgroup is working with the array in the local memory

I already use Kernel.setExplicit, but this absolutely does not make difference in the case of an atomic arrays

For the second case, you can use helper methods to access the atomics without having any direct reference inside the Kernel.run() method, that way Aparapi will believe that you are not changing the values in the array, and thus they are input only. You will always need to transfer the initial values to the kernel, but you don't need to transfer the results back from the kernel, which can save execution time.

I definitely need to try this. But where can I find documentation and examples?

CoreRasurae commented 6 years ago

@gpeabody Essentially what can be achieved is importing the initial values from the global memory, but avoiding from transferring back the AtomicInteger array from the kernel at the end of kernel execution, back to Java. I have never tried to use global memory without having Aparapi transfer the initial data, I believe it is not supported. It could be feasible to initialize the memory inside the kernel by using atomic OpenCL operations to initialize the atomics initial values, under a controlled manner. You can try if it works, by using Kernel.setExplicit(true); while not calling kernel.put(...) or kernel.get(...) for the AtomicIntegers.

gpeabody commented 6 years ago

You can try if it works, by using Kernel.setExplicit(true); while not calling kernel.put(...) or kernel.get(...) for the AtomicIntegers.

That's exactly what I'm doing.

What about "helper methods to access the atomics without having any direct reference inside the Kernel.run() method"?

CoreRasurae commented 6 years ago

@gpeabody It is strange that kernel.setExplicit(true) makes no difference... If you run an Aparapi kernel that depends on non-atomic arrays only and set kernel.setExplicit(true), but you never call kernel.set(...) or kernel.get(...) does it still produce correct results? Unfortunately there is no documentation for that, but I can provide you a simple example.

int resultsArr[] = new int[200];
int atomicsArr[] = new int[200];

public int atomicUpdate(int arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}

public void run() {
    resultsArr[0] = 0; //Ensure resultsArr is touched for Aparapi to transfer the results back  
    atomicUpdate(atomicsArr, 1); //Modifier accesses to atomicArr are in helper function, so that Aparapi will believe atomicsArr is not modified

}
gpeabody commented 6 years ago

This is my test

import com.aparapi.Range;
import java.util.concurrent.atomic.AtomicInteger;

public class test2 {
    public static void main( String[] args )
    {
        final int size = 10000000;

        final float[] a = new float[size];
        final float[] b = new float[size];

        for (int i = 0; i < size; i++) {
            a[i] = (float) (Math.random() * 100);
            b[i] = (float) (Math.random() * 100);
        }

        final float[] sum = new float[size];

        test2Kernel kernel = new test2Kernel(size, a, b);
        Range range = Range.create(size);

        kernel.setExplicit(true);
        kernel.put(a);
        kernel.put(b);

        for (int i = 0; i < 10; i++) {
            long t1 = System.currentTimeMillis();
            kernel.execute(range);
            long t2 = System.currentTimeMillis();

            System.out.println(t2-t1 + " : " + kernel.getExecutionTime());
        }

//        kernel.get(sum);
        AtomicInteger[] counters = kernel.getAtomics();

        System.out.println("Counter = " + String.valueOf(counters[0]));

        kernel.dispose();

    }
}
import com.aparapi.Kernel;
import java.util.concurrent.atomic.AtomicInteger;

public class test2Kernel extends Kernel {
    final int size;
    final float[] a;
    final float[] b;
    float[] sum;

    AtomicInteger[] atomics = new AtomicInteger[2000000];

    public test2Kernel(int _size, float[] _a, float[] _b) {
        size = _size;
        a = _a;
        b = _b;
        sum = new float[size];

        for (int i = 0; i < atomics.length; i++) {
            atomics[i] = new AtomicInteger(0);
        }
    }

    public AtomicInteger[] getAtomics() {
        return atomics;
    }

    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicInc(atomics[0]);
    }
}
gpeabody commented 6 years ago

return atomicInc(arr[index]);

atomicInc (java.util.concurrent.atomic.AtomicInteger) in Kernel cannot be applied to (int)

CoreRasurae commented 6 years ago

@gpeabody Yes, sorry I didn't test my example what you need is:

int resultsArr[] = new int[200];
AtomicInteger[] atomicsArr = new AtomicInteger[200];

public int atomicUpdate(AtomicInteger arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}
CoreRasurae commented 6 years ago

@gpeabody Regarding your current test kernel code, you will need to remove method getAtomics() from the kernel and all calls to it.

gpeabody commented 6 years ago
    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicUpdate(atomics, 1);
    }

    public int atomicUpdate(AtomicInteger arr[], int index) {
        //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
        return atomicInc(arr[index]);
    }

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:806) at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:791) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299) at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323) at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:848) at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697) at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792) at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1503) at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1351) at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1342) at com.aparapi.Kernel.execute(Kernel.java:2856) at com.aparapi.Kernel.execute(Kernel.java:2813) at com.aparapi.Kernel.execute(Kernel.java:2753) at test2.main(test2.java:29)

Process finished with exit code 1

CoreRasurae commented 6 years ago

@gpeabody What Aparapi version are you using?

gpeabody commented 6 years ago

Created by Apache Maven 3.3.9

version=1.8.0 groupId=com.aparapi artifactId=aparapi

CoreRasurae commented 6 years ago

@gpeabody Can you try with the current git code in master branch?

CoreRasurae commented 6 years ago

@gpeabody In the master branch there is new code to deal with Java ByteCode analysis. Also if update to Aparapi 1.9.0 at least you can get extra execution performance on discrete GPUs, by doing OpenCLDevice.setSharedMemory(false).

gpeabody commented 6 years ago

1.9.0 the same...

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:806) at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:791) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299) at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323) at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:848) at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697) at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792) at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1503) at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1351) at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1342) at com.aparapi.Kernel.execute(Kernel.java:2857) at com.aparapi.Kernel.execute(Kernel.java:2814) at com.aparapi.Kernel.execute(Kernel.java:2754) at test2.main(test2.java:29)

Process finished with exit code 1

gpeabody commented 6 years ago
import com.aparapi.Kernel;
import java.util.concurrent.atomic.AtomicInteger;

public class test2Kernel extends Kernel {
    final int size;
    final float[] a;
    final float[] b;
    float[] sum;

    AtomicInteger[] atomics = new AtomicInteger[2000000];

    public test2Kernel(int _size, float[] _a, float[] _b) {
        size = _size;
        a = _a;
        b = _b;
        sum = new float[size];

        for (int i = 0; i < atomics.length; i++) {
            atomics[i] = new AtomicInteger(0);
        }
    }

//    public AtomicInteger[] getAtomics() {
//        return atomics;
//    }

    @Override public void run() {
        int gid = getGlobalId();
        sum[gid] = a[gid] + b[gid];

        atomicUpdate(atomics, 1);
    }

    public int atomicUpdate(AtomicInteger arr[], int index) {
        //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
        return atomicInc(arr[index]);
    }
}
CoreRasurae commented 6 years ago

@gpeabody Sure Aparapi 1.9.0 will give the same issue with that Java Bytecode, but you can improve the execution time of your original kernel just using OpenCLDevice.setSharedMemory(false) on discrete GPUs.

gpeabody commented 6 years ago

How I can get 1.10.0? in 1.10.0 there is no this error?

CoreRasurae commented 6 years ago

@gpeabody Secondly, if you try the code in git master branch, and compile Aparapi. you will likely be able to run that modified kernel. git clone https://github.com/Syncleus/aparapi -b master --single-branch Then build with mvn package

CoreRasurae commented 6 years ago

@gpeabody 1.10.0 is yet to be released. @freemo Is there a release date for this?

gpeabody commented 6 years ago

I cloned 1.10.0 jar but maven does not recognize it. What should I change?

    <dependencies>
        <dependency>
            <groupId>com.aparapi</groupId>
            <artifactId>aparapi</artifactId>
            <version>1.10.0</version>
        </dependency>
    </dependencies>
CoreRasurae commented 6 years ago

@gpeabody No, it won't recognize Aparapi like that, Aparapi 1.10.0 is not yet released, so it isn't on maven central repo. You will need to compile Aparapi 1.10.0-SNAPSHOT from git and then add the JARs to the maven local repository. Then you will need to change your pom.xml to point to 1.10.0-SNAPSHOT

You will need maven documentation on how to manually add a jar to your local maven repository (mvn install:install-file ...).

gpeabody commented 6 years ago

Exception in thread "main" java.lang.NoClassDefFoundError: com/aparapi/natives/NativeLoader at com.aparapi.internal.opencl.OpenCLLoader.(OpenCLLoader.java:43) at com.aparapi.internal.opencl.OpenCLPlatform.getOpenCLPlatforms(OpenCLPlatform.java:73) at com.aparapi.device.OpenCLDevice.listDevices(OpenCLDevice.java:517) at com.aparapi.internal.kernel.KernelManager.createDefaultPreferredDevices(KernelManager.java:212) at com.aparapi.internal.kernel.KernelManager.createDefaultPreferences(KernelManager.java:187) at com.aparapi.internal.kernel.KernelManager.setup(KernelManager.java:55) at com.aparapi.internal.kernel.KernelManager.(KernelManager.java:46) at com.aparapi.internal.kernel.KernelManager.(KernelManager.java:38) at com.aparapi.internal.kernel.KernelRunner.(KernelRunner.java:188) at com.aparapi.Kernel.prepareKernelRunner(Kernel.java:2537) at com.aparapi.Kernel.setExplicit(Kernel.java:3162) at test2.main(test2.java:23) Caused by: java.lang.ClassNotFoundException: com.aparapi.natives.NativeLoader at java.net.URLClassLoader.findClass(URLClassLoader.java:381) at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) ... 12 more

Process finished with exit code 1

gpeabody commented 6 years ago

there is really no NativeLoader in your repo:

import com.aparapi.natives.NativeLoader;

in 1.9.0 also no have NativeLoader but don't give this issue

CoreRasurae commented 6 years ago

@gpeabody You will also need to include aparapi-jni (https://github.com/Syncleus/aparapi-jni) in your pom.xml. You can grab it from with this:

<!-- https://mvnrepository.com/artifact/com.aparapi/aparapi-jni -->
<dependency>
    <groupId>com.aparapi</groupId>
    <artifactId>aparapi-jni</artifactId>
    <version>1.4.1</version>
</dependency>
gpeabody commented 6 years ago

Ok. thank you very much for your patience! returned to the beginning))

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField at com.aparapi.internal.writer.BlockWriter.getUltimateInstanceFieldAccess(BlockWriter.java:808) at com.aparapi.internal.writer.BlockWriter.isMultiDimensionalArray(BlockWriter.java:793) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:464) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.KernelWriter.writeMethod(KernelWriter.java:306) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:647) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeInstruction(BlockWriter.java:638) at com.aparapi.internal.writer.KernelWriter.writeInstruction(KernelWriter.java:780) at com.aparapi.internal.writer.BlockWriter.writeSequence(BlockWriter.java:299) at com.aparapi.internal.writer.BlockWriter.writeBlock(BlockWriter.java:323) at com.aparapi.internal.writer.BlockWriter.writeMethodBody(BlockWriter.java:850) at com.aparapi.internal.writer.KernelWriter.write(KernelWriter.java:697) at com.aparapi.internal.writer.KernelWriter.writeToString(KernelWriter.java:792) at com.aparapi.internal.kernel.KernelRunner.executeInternalInner(KernelRunner.java:1535) at com.aparapi.internal.kernel.KernelRunner.executeInternalOuter(KernelRunner.java:1383) at com.aparapi.internal.kernel.KernelRunner.execute(KernelRunner.java:1374) at com.aparapi.Kernel.execute(Kernel.java:2897) at com.aparapi.Kernel.execute(Kernel.java:2854) at com.aparapi.Kernel.execute(Kernel.java:2794) at test2.main(test2.java:29)

Process finished with exit code 1

CoreRasurae commented 6 years ago

@gpeabody Ok... We'll have to fix that...

CoreRasurae commented 6 years ago

@gpeabody Meanwhile you can try your original version with Aparapi 1.10-SNAPSHOT and OpenCLDevice.setSharedMemory(false). It may improve the performance significantly in some cases.

gpeabody commented 6 years ago

this code that you showed above works correctly on your device?

int resultsArr[] = new int[200];
AtomicInteger[] atomicsArr = new AtomicInteger[200];

public int atomicUpdate(AtomicInteger arr[], int index) {
     //Other logic could be included to avoid having to call atomicUpdate, just to update an atomic
     return atomicInc(arr[index]);
}

public void run() {
    resultsArr[0] = 0; //Ensure resultsArr is touched for Aparapi to transfer the results back  
    atomicUpdate(atomicsArr, 1); //Modifier accesses to atomicArr are in helper function, so that Aparapi will believe atomicsArr is not modified
}

I'll try OpenCLDevice.setSharedMemory(false), but don't think that this will greatly improve the situation. My kernel execution take 20 milliseconds without AtomicInteger[] and 200 milliseconds with it.

CoreRasurae commented 6 years ago

@gpeabody I've similar code to that, but without the atomics, to reduce data transfers. Regarding setSharedMemory(...), atomics will always delay execution a bit, but more importantly discrete GPUs don't share their memory with the host, thus global memory accesses with have to be made through the PCIe bus, which will greatly increase the latency and thus slowdown the kernel execution. All your AtomicInteger arrays are in global memory.

gpeabody commented 6 years ago

Why you use this code for non-atomic data if you can use setExplicit?

I tried to use setSharedMemory but get "non-static method cannot be referenced from a static context". Seems it must be new instance... How to use it correctly?

CoreRasurae commented 6 years ago

@gpeabody It is true that one can use setExplicit(...), it also normal that one wants to structure the code better, that just implement everything in kernel.run() method, currently that also has the side effect of surpassing Aparapi automatic detection of variable usages (that is, if they're used for data Input, Output or both). So it is as alternative way to achieve that, and since you complain that setExplicit(...) is still transferring the results back with setExplicit(true), which I find strange... It is something that will have to checked when I find some time.

You can have a look at the unit/integrations tests available in aparapi sources in src/test/java folder, you have some examples there. As an hint I can say that setSharedMemory(false) is to be called to the specific OpenCLDevice instance, the instance that represents your GPU card, before calling the kernel execute.

gpeabody commented 6 years ago

As I understand after a variety of tests, the function setExplicit disables the transfer of all data except AtomicInteger[] arrays. If the AtomicInteger[] array is small then it is invisible. This becomes important only if the AtomicInteger[] array is large enough.

gpeabody commented 6 years ago

results:

Device isShareMempory(): false ...Device name: Tahiti, Id: 3143440 Device isShareMempory(): false ...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 520386736 Execute time: 562.142399 Execute time: 100.888389 Execute time: 80.919264 Execute time: 81.842185 Execute time: 81.003895 Execute time: 80.972699 Execute time: 82.700549 Execute time: 81.55493 Execute time: 80.306763 Execute time: 81.585818

Device isShareMempory(): true ...Device name: Tahiti, Id: 4714800 Device isShareMempory(): true ...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 522310928 Execute time: 558.12825 Execute time: 101.372396 Execute time: 88.441623 Execute time: 87.150831 Execute time: 84.234118 Execute time: 81.801722 Execute time: 82.307351 Execute time: 82.104111 Execute time: 82.955064 Execute time: 82.216233

gpeabody commented 6 years ago

without AtomicInteger[]:

Device isShareMemory(): false ...Device name: Tahiti, Id: 3143440 Device isShareMemory(): false ...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 521107632 Execution time: 449.848156 Execution time: 18.20946 Execution time: 18.171468 Execution time: 18.020428 Execution time: 18.000968 Execution time: 18.050079 Execution time: 18.031239 Execution time: 18.105677 Execution time: 18.128225 Execution time: 18.144905

Device isShareMemory(): true ...Device name: Tahiti, Id: 2422544 Device isShareMemory(): true ...Device name: AMD FX(tm)-6100 Six-Core Processor , Id: 522877104 Execution time: 456.506282 Execution time: 18.903812 Execution time: 18.929141 Execution time: 20.588421 Execution time: 18.959719 Execution time: 18.14058 Execution time: 18.449456 Execution time: 18.148302 Execution time: 18.225212 Execution time: 18.899797

CoreRasurae commented 6 years ago

@gpeabody I will try to fix the I_ALOAD_1 bytecode issue when passing AtomicInteger parameters, and will have a look around the setExplicit(...) handling with atomic arrays, when I find some time. Currently I am with lots of work on other projects.

Anyway, keep in mind that atomics will always slow down the application a bit, because they involve more complex operations than a simple sum of two integers, also you will always have to pass the initial values from AtomicInteger[] into the kernel, because there is no way in OpenCL 1.x to do synchronization across all threads... it is only possible to synchronize across the local workgroup, thus the only way for all threads to see the same initial values is to transfer the global atomic array into the GPU before starting the kernel. What you can save, is avoiding the transfer of the atomic values from the GPU back to the host at the end of kernel execution. Also note that transferring arrays of AtomicInteger will always be slower, because they are non-primitive types and have to be handled in a special way.

gpeabody commented 6 years ago

Can I initialize the AtomicInteger array once, transfer it to the GPU before the kernel is first started, and after that do not transfer and do not receive it back at all? I need it only to work in the global GPU memory.

CoreRasurae commented 6 years ago

@gpeabody It should be possible, but only after fixing that setExplicit(...) for AtomicIntegers

gpeabody commented 6 years ago

OK. Thank you for your understanding. I hope it does not take long.

CoreRasurae commented 6 years ago

@gpeabody I can't guarantee any time frame for this at the current moment, if you're needing this soon, maybe you can try to look at the code. The relevant code is in java.com.aparapi.internal.kernel.KernelRunner class, in methods private boolean prepareAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException and private void extractAtomicIntegerConversionBuffer(KernelArg arg) throws AparapiException. You can propose a fix.

gpeabody commented 6 years ago

If I change two next lines in KernelRunner.java it will be run correctly or it will damage another logic?

1040 if (!explicit) extractAtomicIntegerConversionBuffer(arg);

1183 if (!explicit) prepareAtomicIntegerConversionBuffer(arg);

CoreRasurae commented 6 years ago

@gpeabody Feel free to try. I believe that, by itself, is not sufficient, you would also need to define Kernel.put(AtomicInteger[] arr) and Kernel.get(AtomicInteger[] arr) to ensure you can transfer the data to the kernel. You may also need some additional changes to that in order to ensure that even if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL. There's nothing wrong in trying small changes until it does what is needed. You can also run the validation tests to help verify that nothing else was broken.

gpeabody commented 6 years ago

I don't use this arrays on host, only inside GPU. For me is no necessary Kernel.get(AtomicInteger[] arr). But if first inicialisation run on host, then I need Kernel.put(AtomicInteger[] arr). Isn't it?

"if no transfer is made, memory is allocated in the GPU for the atomic array which will be an int array in OpenCL" How i can be sure that memory is allocated in the GPU?

CoreRasurae commented 6 years ago

The only way of initializing/allocate GPU GlobalMemory in OpenCL is to transfer the data from the host, so yes, you will need kernel.put(AtomicInteger[] arr), or at least transfer an empty array that will become associated with the AtomicInteger[].

gpeabody commented 6 years ago

Hello. It's me again. I can not run the two changes I wrote above.

If I change if (!explicit) prepareAtomicIntegerConversionBuffer(arg); the compilation does not pass the test. I can't get the jar file.

If I only use if (!explicit) extractAtomicIntegerConversionBuffer(arg); the compilation goes through. My test runs twice as fast. But AtomicInteger does not work. atomicInc(atomics[0]); in the end gives 0.

What can I do more?

gpeabody commented 6 years ago

Now it does not compile at all. I did a git clone again. But all the same does not compile.

CoreRasurae commented 6 years ago

@gpeabody Aparapi 1.10.0 will be released soon, it will include #139 which fixes one of the issues you were having:

Exception in thread "main" java.lang.ClassCastException: com.aparapi.internal.instruction.InstructionSet$I_ALOAD_1 cannot be cast to com.aparapi.internal.instruction.InstructionSet$AccessField
gpeabody commented 5 years ago

@CoreRasurae Hello I've installed 1.10.1 version. No more I_ALOAD_1 error, thank you. But the time running of kernel the same as before. It does not matter to use atomicInc inside run() method or inside atomicUpdate() it take 66 miliseconds. Against with 15 miliseconds with no atomics.

Without atomics:

Execution time: 346.024275 Execution time: 17.025696 Execution time: 16.60802 Execution time: 16.362656 Execution time: 15.769028 Execution time: 15.703015 Execution time: 16.618698 Execution time: 18.433077 Execution time: 17.662524 Execution time: 17.416675

With atomicUpdate:

Execution time: 444.818112 Execution time: 77.709088 Execution time: 66.59614 Execution time: 66.905817 Execution time: 67.165014 Execution time: 66.602935 Execution time: 66.466299 Execution time: 68.917506 Execution time: 67.030319 Execution time: 68.720196