beehive-lab / TornadoVM

TornadoVM: A practical and efficient heterogeneous programming framework for managed languages
https://www.tornadovm.org
Apache License 2.0
1.18k stars 111 forks source link

Kernel fails with clEnqueueNDRangeKernel -> Returned: -5 with openCL. #153

Closed yazun closed 2 years ago

yazun commented 2 years ago

Describe the bug This is continuation of the issue https://github.com/beehive-lab/TornadoVM/issues/152 but with asm ver 9.2 replacing bundled ver 7.2

How To Reproduce Same test as in https://github.com/beehive-lab/TornadoVM/issues/152

The code of the computeGPU function looks like:

private static void computeGPU(double[] normObsTimes, double[] normObsValues, int nScanFrequencies, double firstScanFrequency, double frequencyStep, double deltaEpsilon, double[] amplitudes)
    {
        int nObservations = normObsTimes.length;
        final double omega = firstScanFrequency * 2 * Math.PI;      
        // starting frequency
        final double deltaOmega = frequencyStep * 2 * Math.PI;      
        // frequency step
        final double[] sumSx = new double[nScanFrequencies];        
        // sum of sin(obsTimes[i]*2pi*f)
        final double[] sumCx = new double[nScanFrequencies];        
        // sum of cos(obsTimes[i]*2pi*f)
        final double[] sumSx2 = new double[nScanFrequencies];       
        // sum of sin(obsTimes[i]*2pi*f)*sin(obsTimes[i]*2pi*f)
        final double[] sumCx2 = new double[nScanFrequencies];       
        // sum of cos(obsTimes[i]*2pi*f)*cos(obsTimes[i]*2pi*f)
        final double[] sumSxCx = new double[nScanFrequencies];      
        // sum of sin(obsTimes[i]*2pi*f)*cos(obsTimes[i]*2pi*f)
        final double[] sumSxVal = new double[nScanFrequencies];         
        // sum of sin(obsTimes[i]*2pi*f)*obsValue[i]
        final double[] sumCxVal = new double[nScanFrequencies];         
        // sum of cos(obsTimes[i]*2pi*f)*obsValue[i]

        // for each set of observation data 
        for (int i = 0; i < nObservations; i++)   {
            final double obsTime = normObsTimes[i];
            final double obsValue = normObsValues[i];

            // calculate the starting phase and it's sine and cosine
            final double phase = obsTime * omega;
            double sPh = Math.sin(phase);
            double cPh = Math.cos(phase);

            // calculate the phase step and it's sine and cosine
            final double dPhase = obsTime * deltaOmega;
            final double sDPh = Math.sin(dPhase);
            final double cDPh = Math.cos(dPhase);

            // for each frequency to test, increment the phase with the phase step
            for (@Parallel int j = 0; j < nScanFrequencies; j++) {
                sumSx[j] += sPh;
                sumCx[j] += cPh;
                sumSx2[j] += sPh*sPh;
                sumCx2[j] += cPh*cPh;
                sumSxCx[j] += cPh*sPh;
                sumSxVal[j] += sPh*obsValue;
                sumCxVal[j] += cPh*obsValue;

                final double cT = cPh;
                cPh = cT * cDPh - sPh * sDPh;
                sPh = sPh * cDPh + cT * sDPh;
            }
        }

        // calculate intermediate variables and results
        for (@Parallel int i = 0; i < nScanFrequencies; i++)    {
            final double d = nObservations * ( sumCx2[i] * sumSx2[i] - sumSxCx[i] * sumSxCx[i]) 
                    - sumCx[i] * sumCx[i] * sumSx2[i] 
                            - sumSx[i] * sumSx[i] * sumCx2[i] 
                                    + 2 * sumSx[i] * sumCx[i] * sumSxCx[i];  
            if (d < deltaEpsilon) {
                amplitudes[i] = MAXRANGE;
            } else {
                final double b = sumCx[i] * sumSxVal[i] - sumSx[i] * sumCxVal[i];
                final double c1 = nObservations
                        * ( sumCxVal[i] * sumSx2[i] - sumSxVal[i] * sumSxCx[i] ) + sumSx[i] * b;
                final double c2 = nObservations
                        * ( sumSxVal[i] * sumCx2[i] - sumCxVal[i] * sumSxCx[i] ) - sumCx[i] * b;
                amplitudes[i] = ( c1 * sumCxVal[i] + c2 * sumSxVal[i] ) / d;
            }
        }

    }

// then called as 

        TaskSchedule task = new TaskSchedule("PeriodSearch") 
//                .streamIn( normObsTimes, normObsValues, nScanFrequencies, firstScanFrequency, frequencyStep, deltaEpsilon) //
                .task("t0", MethodLeastSquareGPU::computeGPU, normObsTimes, normObsValues, nScanFrequencies, firstScanFrequency, frequencyStep, deltaEpsilon,  amplitudes ) 
                .streamOut(amplitudes);
        task.execute();

This time the kernel is generated but the code fails.

 java -cp build:lib/compile/AGISDm-17.1.0.jar:lib/compile/AGISLab-17.1.0.jar:lib/compile/AGISTools-17.1.0.jar:lib/compile/GaiaMdbDm-20.0.18.jar:lib/compile/GaiaParameters-21.1.0.jar:lib/compile/GaiaTools-20.6.1.jar:lib/compile/GaiaToolsDm-21.3.0.jar:lib/compile/HikariCP-3.4.5.jar:lib/compile/Taglets-1.0.jar:lib/compile/VariConfiguration-SB-22.0.0-r734000-20211123155220.jar:lib/compile/VariFramework-SB-22.0.0-r734040M-20211124090454.jar:lib/compile/VariObjectModel-SB-22.0.0-r73320211123174628.jar:lib/compile/VariStatistics-SB-22.0.0-r728093M-20211124091035.jar:lib/compile/activemq-broker-5.14.5.jar:lib/compile/activemq-client-5.14.5.jar:lib/compile/activemq-http-5.14.5.jar:lib/compile/activemq-jms-pool-5.14.5.jar:lib/compile/activemq-mqtt-5.14.5.jar:lib/compile/activemq-openwire-legacy-5.14.5.jar:lib/compile/activemq-pool-5.14.5.jar:lib/compile/activemq-spring-5.14.5.jar:lib/compile/activemq-stomp-5.14.5.jar:lib/compile/annotations-13.0.jar:lib/compile/aopalliance-1.0.jar:lib/compile/aspectjrt-1.9.2.jar:lib/compile/aspectjtools-1.9.2.jar:lib/compile/aspectjweaver-1.9.6.jar:lib/compile/assertj-core-3.18.1.jar:lib/compile/atomikos-util-3.9.3.jar:lib/compile/byte-buddy-1.10.17.jar:lib/compile/byte-buddy-agent-1.10.5.jar:lib/compile/cache-api-1.0.0.jar:lib/compile/caffeine-2.8.8.jar:lib/compile/camel-api-3.7.2.jar:lib/compile/camel-base-3.7.2.jar:lib/compile/camel-base-engine-3.7.2.jar:lib/compile/camel-bean-3.7.2.jar:lib/compile/camel-browse-3.7.2.jar:lib/compile/camel-caffeine-lrucache-3.7.2.jar:lib/compile/camel-cloud-3.7.2.jar:lib/compile/camel-cluster-3.7.2.jar:lib/compile/camel-controlbus-3.7.2.jar:lib/compile/camel-core-3.7.2.jar:lib/compile/camel-core-catalog-3.7.2.jar:lib/compile/camel-core-engine-3.7.2.jar:lib/compile/camel-core-languages-3.7.2.jar:lib/compile/camel-core-model-3.7.2.jar:lib/compile/camel-core-processor-3.7.2.jar:lib/compile/camel-core-reifier-3.7.2.jar:lib/compile/camel-core-starter-3.7.2.jar:lib/compile/camel-core-xml-3.7.2.jar:lib/compile/camel-dataformat-3.7.2.jar:lib/compile/camel-dataset-3.7.2.jar:lib/compile/camel-direct-3.7.2.jar:lib/compile/camel-directvm-3.7.2.jar:lib/compile/camel-file-3.7.2.jar:lib/compile/camel-health-3.7.2.jar:lib/compile/camel-jms-3.7.2.jar:lib/compile/camel-language-3.7.2.jar:lib/compile/camel-log-3.7.2.jar:lib/compile/camel-main-3.7.2.jar:lib/compile/camel-management-api-3.7.2.jar:lib/compile/camel-metrics-3.7.2.jar:lib/compile/camel-mock-3.7.2.jar:lib/compile/camel-ref-3.7.2.jar:lib/compile/camel-rest-3.7.2.jar:lib/compile/camel-saga-3.7.2.jar:lib/compile/camel-scheduler-3.7.2.jar:lib/compile/camel-seda-3.7.2.jar:lib/compile/camel-spring-3.7.2.jar:lib/compile/camel-spring-boot-3.7.2.jar:lib/compile/camel-spring-boot-starter-3.7.2.jar:lib/compile/camel-stub-3.7.2.jar:lib/compile/camel-support-3.7.2.jar:lib/compile/camel-test-3.7.2.jar:lib/compile/camel-test-spring-3.7.2.jar:lib/compile/camel-timer-3.7.2.jar:lib/compile/camel-tooling-model-3.7.2.jar:lib/compile/camel-util-3.7.2.jar:lib/compile/camel-util-json-3.7.2.jar:lib/compile/camel-validator-3.7.2.jar:lib/compile/camel-vm-3.7.2.jar:lib/compile/camel-xml-jaxb-3.7.2.jar:lib/compile/camel-xml-jaxp-3.7.2.jar:lib/compile/camel-xpath-3.7.2.jar:lib/compile/camel-xslt-3.7.2.jar:lib/compile/checker-qual-3.5.0.jar:lib/compile/classmate-1.5.1.jar:lib/compile/cloning-1.9.12.jar:lib/compile/common-3.6.jar:lib/compile/commons-beanutils-1.9.3.jar:lib/compile/commons-codec-1.9.jar:lib/compile/commons-collections-3.2.2.jar:lib/compile/commons-collections4-4.4.jar:lib/compile/commons-compress-1.19.jar:lib/compile/commons-configuration-1.7.jar:lib/compile/commons-dbcp-1.4.jar:lib/compile/commons-digester-1.8.1.jar:lib/compile/commons-io-2.6.jar:lib/compile/commons-lang-2.6.jar:lib/compile/commons-lang3-3.9.jar:lib/compile/commons-logging-1.2.jar:lib/compile/commons-math3-3.6.1.jar:lib/compile/commons-net-3.5.jar:lib/compile/commons-pool-1.6.jar:lib/compile/commons-pool2-2.4.2.jar:lib/compile/commons-rng-client-api-1.3.jar:lib/compile/commons-rng-core-1.3.jar:lib/compile/commons-rng-sampling-1.3.jar:lib/compile/commons-rng-simple-1.3.jar:lib/compile/disruptor-3.4.2.jar:lib/compile/ejb-api-3.0-alpha-1.jar:lib/compile/error_prone_annotations-2.3.4.jar:lib/compile/failureaccess-1.0.1.jar:lib/compile/flanagan-1.0.jar:lib/compile/freehep-io-2.0.5.jar:lib/compile/fst-2.57.jar:lib/compile/geojson-jackson-1.5.1.jar:lib/compile/geronimo-j2ee-management_1.1_spec-1.0.1.jar:lib/compile/geronimo-jms_1.1_spec-1.1.1.jar:lib/compile/geronimo-jms_2.0_spec-1.0-alpha-2.jar:lib/compile/geronimo-jta_1.0.1B_spec-1.0.1.jar:lib/compile/groovy-4.0.0-beta-1.jar:lib/compile/groovy-jsr223-4.0.0-beta-1.jar:lib/compile/gson-2.5.jar:lib/compile/guava-30.1-jre.jar:lib/compile/guice-5.0.1.jar:lib/compile/hamcrest-2.2.jar:lib/compile/hamcrest-all-1.3.jar:lib/compile/hamcrest-core-1.3.jar:lib/compile/hamcrest-library-1.3.jar:lib/compile/hawtbuf-1.11.jar:lib/compile/hawtdispatch-1.22.jar:lib/compile/hawtdispatch-transport-1.22.jar:lib/compile/hibernate-commons-annotations-5.1.2.Final.jar:lib/compile/hibernate-core-5.4.27.Final.jar:lib/compile/htmIndex-3.0.2.jar:lib/compile/httpclient-4.5.2.jar:lib/compile/httpcore-4.4.5.jar:lib/compile/ignite-core-2.7.0.jar:lib/compile/ignite-shmem-1.0.0.jar:lib/compile/interval-tree-1.0.0.jar:lib/compile/istack-commons-runtime-4.0.0.jar:lib/compile/itext-2.0.1.jar:lib/compile/j2objc-annotations-1.3.jar:lib/compile/j3d-core-1.3.1.jar:lib/compile/j3d-core-utils-1.3.1.jar:lib/compile/jackson-annotations-2.13.0.jar:lib/compile/jackson-core-2.13.0.jar:lib/compile/jackson-databind-2.13.0.jar:lib/compile/jakarta.activation-2.0.0.jar:lib/compile/jakarta.annotation-api-1.3.5.jar:lib/compile/jakarta.persistence-api-2.2.3.jar:lib/compile/jakarta.transaction-api-1.3.3.jar:lib/compile/jakarta.xml.bind-api-3.0.0.jar:lib/compile/jama-1.0.3.jar:lib/compile/jandex-2.1.3.Final.jar:lib/compile/javaparser-1.0.11.jar:lib/compile/javassist-3.27.0-GA.jar:lib/compile/javax.activation-1.2.0.jar:lib/compile/javax.activation-api-1.2.0.jar:lib/compile/javax.inject-1.jar:lib/compile/javax.persistence-2.1.0.jar:lib/compile/javax.servlet-api-4.0.1.jar:lib/compile/jaxb-api-2.1.9.jar:lib/compile/jaxb-api-2.3.1.jar:lib/compile/jaxb-core-2.3.0.1.jar:lib/compile/jaxb-core-3.0.0.jar:lib/compile/jaxb-impl-2.4.0-b180830.0438.jar:lib/compile/jaxb-runtime-3.0.0.jar:lib/compile/jaxb2-basics-0.12.0.jar:lib/compile/jaxb2-basics-ant-0.12.0.jar:lib/compile/jaxb2-basics-runtime-0.12.0.jar:lib/compile/jaxb2-basics-tools-0.12.0.jar:lib/compile/jaxb2-default-value-1.1.jar:lib/compile/jaxb2-value-constructor-3.0.jar:lib/compile/jaxen-1.0-FCS.jar:lib/compile/jaxrpc-api-1.1.jar:lib/compile/jboss-serialization-4.2.2.GA.jar:lib/compile/jcl-over-slf4j-1.7.7.jar:lib/compile/jcommon-1.0.23latex.jar:lib/compile/jdom-1.0.jar:lib/compile/jetty-all-9.2.13.v20150730.jar:lib/compile/jfreechart-1.5.0.jar:lib/compile/jfreechartbinding-0.0.6.jar:lib/compile/jhealpix-3.2.0.jar:lib/compile/jmathtex-0.7pre.jar:lib/compile/jolokia-jvm-agent-1.6.2.jar:lib/compile/jsap-2.1.jar:lib/compile/jsr305-3.0.1.jar:lib/compile/junit-4.13.1.jar:lib/compile/junit-addons-1.4.jar:lib/compile/krasa-jaxb-tools-1.4.jar:lib/compile/kryo-5.1.1.jar:lib/compile/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:lib/compile/log4j-api-2.12.1.jar:lib/compile/log4j-core-2.12.1.jar:lib/compile/log4j-slf4j-impl-2.12.1.jar:lib/compile/log4j2-logstash-layout-1.0.2.jar:lib/compile/lombok-1.18.20.jar:lib/compile/lz4-1.3.0.jar:lib/compile/lz4-java-1.4.0.jar:lib/compile/management-api-1.1-rev-1.jar:lib/compile/metrics-core-4.1.12.1.jar:lib/compile/metrics-jmx-4.1.12.1.jar:lib/compile/metrics-json-4.1.12.1.jar:lib/compile/minlog-1.3.1.jar:lib/compile/mockito-core-3.2.4.jar:lib/compile/mvel2-2.4.12.Final.jar:lib/compile/net-ivoa-fits-0.1.jar:lib/compile/objenesis-3.2.jar:lib/compile/ognl-2.6.9.jar:lib/compile/openjpa-all-3.2.3-CU7.jar:lib/compile/org.apache.bval.bundle-1.1.2.jar:lib/compile/postgresql-42.2.12.jre7.jar:lib/compile/preferences-3.6.jar:lib/compile/reflectasm-1.11.9.jar:lib/compile/rxjava-1.2.0.jar:lib/compile/saxpath-1.0-FCS.jar:lib/compile/simple-5.1.6.jar:lib/compile/slf4j-api-1.7.30.jar:lib/compile/snakeyaml-1.27.jar:lib/compile/spring-aop-5.3.3.jar:lib/compile/spring-aspects-5.3.3.jar:lib/compile/spring-beans-5.3.3.jar:lib/compile/spring-boot-2.4.2.jar:lib/compile/spring-boot-autoconfigure-2.4.2.jar:lib/compile/spring-boot-starter-2.4.2.jar:lib/compile/spring-boot-starter-aop-2.4.2.jar:lib/compile/spring-boot-starter-data-jpa-2.4.2.jar:lib/compile/spring-boot-starter-jdbc-2.4.2.jar:lib/compile/spring-context-5.3.3.jar:lib/compile/spring-core-5.3.3.jar:lib/compile/spring-data-commons-2.4.3.jar:lib/compile/spring-data-jpa-2.4.3.jar:lib/compile/spring-expression-5.3.3.jar:lib/compile/spring-jcl-5.3.3.jar:lib/compile/spring-jdbc-5.3.3.jar:lib/compile/spring-jms-5.3.3.jar:lib/compile/spring-messaging-5.3.3.jar:lib/compile/spring-orm-5.3.3.jar:lib/compile/spring-test-5.3.3.jar:lib/compile/spring-tx-5.3.3.jar:lib/compile/stringtemplate-3.2.1.jar:lib/compile/text-3.5.jar:lib/compile/tornado-api-0.12.jar:lib/compile/tornado-matrices-0.12.jar:lib/compile/transaction-api-1.1.jar:lib/compile/transactions-3.9.3.jar:lib/compile/transactions-api-3.9.3.jar:lib/compile/transactions-jdbc-3.9.3.jar:lib/compile/transactions-jta-3.9.3.jar:lib/compile/txw2-3.0.0.jar:lib/compile/unitils-core-3.4.6.jar:lib/compile/validation-api-1.0.0.GA.jar:lib/compile/vecmath-1.3.1.jar:lib/compile/xbean-spring-4.2.jar:lib/compile/xmlpull-1.1.3.1.jar:lib/compile/xmlunit-1.4.jar:lib/compile/xpp3_min-1.1.4c.jar:lib/compile/xstream-1.4.11.1.jar:lib/compile/xz-1.6.jar:lib/compile/zstd-jni-1.4.4-7.jar:$CU7COMMON/conf:$SOFCOMMON/apache-ant-1.10.5/lib/ant-launcher.jar:$SOFCOMMON/apache-ant-1.10.5/lib/ant.jar:$SOFCOMMON/apache-ant-1.10.5/lib/ant-junit.jar:$SOFCOMMON/apache-ant-1.10.5/lib/ant-junit4.jar -server -XX:-UseCompressedOops -XX:+UnlockExperimentalVMOptions -XX:+EnableJVMCI -Djava.library.path=/d00/local/tornado/tornadovm_fresh/bin/sdk/lib --module-path .:/d00/local/tornado/tornadovm_fresh/bin/sdk/share/java/tornado  -Dtornado.load.api.implementation=uk.ac.manchester.tornado.runtime.tasks.TornadoTaskSchedule -Dtornado.load.runtime.implementation=uk.ac.manchester.tornado.runtime.TornadoCoreRuntime -Dtornado.load.tornado.implementation=uk.ac.manchester.tornado.runtime.common.Tornado -Dtornado.load.device.implementation.opencl=uk.ac.manchester.tornado.drivers.opencl.runtime.OCLDeviceFactory -Dtornado.load.device.implementation.ptx=uk.ac.manchester.tornado.drivers.ptx.runtime.PTXDeviceFactory -Dtornado.load.device.implementation.spirv=uk.ac.manchester.tornado.drivers.spirv.runtime.SPIRVDeviceFactory -Dtornado.load.annotation.implementation=uk.ac.manchester.tornado.annotation.ASMClassVisitor -Dtornado.load.annotation.parallel=uk.ac.manchester.tornado.api.annotations.Parallel   -XX:+UseParallelGC @/d00/local/tornado/tornadovm_fresh/bin/sdk/etc/exportLists/common-exports  --upgrade-module-path /d00/local/tornado/tornadovm_fresh/bin/sdk/share/java/graalJars  @/d00/local/tornado/tornadovm_fresh/bin/sdk/etc/exportLists/opencl-exports --add-modules ALL-SYSTEM,tornado.runtime,tornado.annotation,tornado.drivers.common,tornado.drivers.opencl,org.apache.logging.log4j -DPeriodSearch.t0.device=0:0  -Dtornado.debug=True -Dtornado.print.kernel=True -Dtornado.profiler=True  -Dtornado.heap.allocation=4096MB org.junit.runner.JUnitCore gaia.cu7.algo.character.periodsearch.methods.test.MethodLeastSquareTest
WARNING: Using incubator modules: jdk.incubator.vector, jdk.incubator.foreign
JUnit version 4.13.1
........#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void lookupBufferAddress(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

  // BLOCK 0
  _frame[0]  =  (ulong) _heap_base;
}  //  kernel

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void computeGPU(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{
  int i_45, i_10, i_128, i_61, i_30, i_62, i_94, i_27, i_59, i_60, i_25, i_57, i_89, i_26, i_58, i_23, i_55, i_24, i_56, i_21, i_22, i_54;
  ulong ul_1, ul_2, ul_0, ul_16, ul_93, ul_14;
  double d_53, d_48, d_47, d_46, d_52, d_51, d_50, d_49, d_40, d_39, d_38, d_37, d_44, d_43, d_42, d_41, d_32, d_31, d_29, d_36, d_35, d_34, d_33, d_88, d_87, d_86, d_85, d_80, d_79, d_78, d_77, d_84, d_83, d_82, d_81, d_72, d_71, d_70, d_69, d_76, d_75, d_74, d_73, d_64, d_63, d_68, d_67, d_66, d_65, d_120, d_119, d_118, d_117, d_124, d_123, d_122, d_121, d_112, d_111, d_110, d_109, d_116, d_115, d_114, d_113, d_104, d_103, d_102, d_101, d_108, d_107, d_106, d_105, d_96, d_95, d_100, d_99, d_98, d_97, d_28, d_15, d_20, d_19, d_18, d_17, d_127, d_126, d_125;
  long l_90, l_11, l_91, l_12, l_92, l_13;

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

  // BLOCK 0
  ul_0  =  (ulong) _frame[3];
  ul_1  =  (ulong) _frame[4];
  ul_2  =  (ulong) _frame[9];
  __private double ul_3[1494001];
  __private double ul_4[1494001];
  __private double ul_5[1494001];
  __private double ul_6[1494001];
  __private double ul_7[1494001];
  __private double ul_8[1494001];
  __private double ul_9[1494001];
  // BLOCK 1 MERGES [0 5 ]
  i_10  =  0;
  for(;i_10 < 165;)
  {
    // BLOCK 2
    l_11  =  (long) i_10;
    l_12  =  l_11 << 3;
    l_13  =  l_12 + 24L;
    ul_14  =  ul_0 + l_13;
    d_15  =  *((__global double *) ul_14);
    ul_16  =  ul_1 + l_13;
    d_17  =  *((__global double *) ul_16);
    d_18  =  d_15 * 0.37699111843077515;
    d_19  =  native_cos(d_18);
    d_20  =  native_sin(d_18);
    i_21  =  get_global_size(1);
    i_22  =  i_21 + 1494000;
    i_23  =  i_22 / i_21;
    i_24  =  get_global_id(1);
    i_25  =  i_23 * i_24;
    i_26  =  i_25 + i_23;
    i_27  =  min(i_26, 1494001);
    // BLOCK 3 MERGES [2 4 ]
    d_28  =  d_20;
    d_29  =  d_19;
    i_30  =  i_25;
    for(;i_30 < i_27;)
    {
      // BLOCK 4
      d_31  =  ul_3[i_30];
      d_32  =  d_28 + d_31;
      ul_3[i_30]  =  d_32;
      d_31  =  d_32;
      d_33  =  ul_4[i_30];
      d_34  =  d_29 + d_33;
      ul_4[i_30]  =  d_34;
      d_33  =  d_34;
      d_35  =  ul_5[i_30];
      d_36  =  fma(d_28, d_28, d_35);
      ul_5[i_30]  =  d_36;
      d_35  =  d_36;
      d_37  =  ul_6[i_30];
      d_38  =  fma(d_29, d_29, d_37);
      ul_6[i_30]  =  d_38;
      d_37  =  d_38;
      d_39  =  ul_7[i_30];
      d_40  =  fma(d_28, d_29, d_39);
      ul_7[i_30]  =  d_40;
      d_39  =  d_40;
      d_41  =  ul_8[i_30];
      d_42  =  fma(d_28, d_17, d_41);
      ul_8[i_30]  =  d_42;
      d_41  =  d_42;
      d_43  =  ul_9[i_30];
      d_44  =  fma(d_29, d_17, d_43);
      ul_9[i_30]  =  d_44;
      d_43  =  d_44;
      i_45  =  i_30 + 1;
      d_46  =  d_15 * 6.283185307179587E-5;
      d_47  =  native_cos(d_46);
      d_48  =  d_47 * d_29;
      d_49  =  native_sin(d_46);
      d_50  =  d_49 * d_28;
      d_51  =  d_48 - d_50;
      d_52  =  d_49 * d_29;
      d_53  =  fma(d_47, d_28, d_52);
      d_28  =  d_53;
      d_29  =  d_51;
      i_30  =  i_45;
    }  // B4
    // BLOCK 5
    i_54  =  i_10 + 1;
    i_10  =  i_54;
  }  // B5
  // BLOCK 6
  i_55  =  get_global_size(0);
  i_56  =  i_55 + 1494000;
  i_57  =  i_56 / i_55;
  i_58  =  get_global_id(0);
  i_59  =  i_57 * i_58;
  i_60  =  i_59 + i_57;
  i_61  =  min(i_60, 1494001);
  // BLOCK 7 MERGES [6 11 ]
  i_62  =  i_59;
  for(;i_62 < i_61;)
  {
    // BLOCK 8
    d_63  =  ul_6[i_62];
    d_64  =  ul_5[i_62];
    d_65  =  ul_7[i_62];
    d_66  =  ul_7[i_62];
    d_67  =  ul_4[i_62];
    d_68  =  ul_4[i_62];
    d_69  =  ul_5[i_62];
    d_70  =  ul_3[i_62];
    d_71  =  ul_3[i_62];
    d_72  =  ul_6[i_62];
    d_73  =  ul_3[i_62];
    d_74  =  ul_4[i_62];
    d_75  =  ul_7[i_62];
    d_76  =  d_73 * 2.0;
    d_77  =  d_76 * d_74;
    d_78  =  d_63 * d_64;
    d_79  =  d_65 * d_66;
    d_80  =  d_78 - d_79;
    d_81  =  d_80 * 165.0;
    d_82  =  d_67 * d_68;
    d_83  =  d_82 * d_69;
    d_84  =  d_81 - d_83;
    d_85  =  d_70 * d_71;
    d_86  =  d_85 * d_72;
    d_87  =  d_84 - d_86;
    d_88  =  fma(d_77, d_75, d_87);
    i_89  =  i_62 + 1;
    l_90  =  (long) i_62;
    l_91  =  l_90 << 3;
    l_92  =  l_91 + 24L;
    ul_93  =  ul_2 + l_92;
    i_94  =  isless(d_88, 0.0);
    if(i_94 == 1)
    {
      // BLOCK 9
      *((__global double *) ul_93)  =  1.0;
    }  // B9
    else
    {
      // BLOCK 10
      d_95  =  ul_4[i_62];
      d_96  =  ul_8[i_62];
      d_97  =  ul_3[i_62];
      d_98  =  ul_9[i_62];
      d_99  =  ul_9[i_62];
      d_100  =  ul_5[i_62];
      d_101  =  ul_8[i_62];
      d_102  =  ul_7[i_62];
      d_103  =  ul_3[i_62];
      d_104  =  ul_8[i_62];
      d_105  =  ul_6[i_62];
      d_106  =  ul_9[i_62];
      d_107  =  ul_7[i_62];
      d_108  =  ul_4[i_62];
      d_109  =  ul_9[i_62];
      d_110  =  ul_8[i_62];
      d_111  =  d_95 * d_96;
      d_112  =  d_97 * d_98;
      d_113  =  d_111 - d_112;
      d_114  =  d_99 * d_100;
      d_115  =  d_101 * d_102;
      d_116  =  d_114 - d_115;
      d_117  =  d_116 * 165.0;
      d_118  =  fma(d_113, d_103, d_117);
      d_119  =  d_104 * d_105;
      d_120  =  d_106 * d_107;
      d_121  =  d_119 - d_120;
      d_122  =  d_121 * 165.0;
      d_123  =  d_108 * d_113;
      d_124  =  d_122 - d_123;
      d_125  =  d_124 * d_110;
      d_126  =  fma(d_118, d_109, d_125);
      d_127  =  d_126 / d_88;
      *((__global double *) ul_93)  =  d_127;
    }  // B10
    // BLOCK 11 MERGES [9 10 ]
    i_128  =  i_89;
    i_62  =  i_128;
  }  // B11
  // BLOCK 12
  return;
}  //  kernel

[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
{
    "PeriodSearch": {
        "TOTAL_GRAAL_COMPILE_TIME": "80653293",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_BYTE_CODE_GENERATION": "7282076",
        "COPY_IN_TIME": "8736",
        "TOTAL_TASK_SCHEDULE_TIME": "323625524",
        "TOTAL_DRIVER_COMPILE_TIME": "134155777",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "102839",
        "TOTAL_KERNEL_TIME": "0",
        "COPY_OUT_TIME": "1016541",
        "TOTAL_COPY_OUT_SIZE_BYTES": "11952032",
        "TOTAL_COPY_IN_SIZE_BYTES": "5376",
        "PeriodSearch.t0": {
            "METHOD": "MethodLeastSquareGPU.computeGPU",
            "DEVICE_ID": "0:0",
            "DEVICE": "Intel(R) Xeon(R) CPU E3-1585L v5 @ 3.00GHz",
            "TOTAL_COPY_IN_SIZE_BYTES": "80",
            "TASK_COMPILE_DRIVER_TIME": "134155777",
            "TASK_KERNEL_TIME": "0",
            "TASK_COMPILE_GRAAL_TIME": "80653293"
        }
    }
}

E................................I..

test fails with empty array returned.

Time: 6.239
There was 1 failure:
1) testGPUGaiaSource4660664932119220224[0](gaia.cu7.algo.character.periodsearch.methods.test.MethodLeastSquareTest)
java.lang.IndexOutOfBoundsException: Index 0 out of bounds for length 0
        at java.base/jdk.internal.util.Preconditions.outOfBounds(Preconditions.java:64)
        at java.base/jdk.internal.util.Preconditions.outOfBoundsCheckIndex(Preconditions.java:70)
        at java.base/jdk.internal.util.Preconditions.checkIndex(Preconditions.java:266)
        at java.base/java.util.Objects.checkIndex(Objects.java:359)
        at java.base/java.util.ArrayList.get(ArrayList.java:427)

I understand [TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -5 is CL_OUT_OF_RESOURCES and -58 is CL_INVALID_EVENT?

Expected behavior

To run without any error.

Computing system setup (please complete the following information):

Additional context Normally, nScanFrequencies could be 10-1000x higher but we lowered it for the sake of the test.

yazun commented 2 years ago

nb. Error is similar for both CPU and GPU generated kernels.

GPU kernel and profile for completeness

........#pragma OPENCL EXTENSION cl_khr_fp64 : enable
__kernel void lookupBufferAddress(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

  // BLOCK 0
  _frame[0]  =  (ulong) _heap_base;
}  //  kernel

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
__kernel void lookupBufferAddress(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

  // BLOCK 0
  _frame[0]  =  (ulong) _heap_base;
}  //  kernel

#pragma OPENCL EXTENSION cl_khr_fp64 : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
__kernel void computeGPU(__global uchar *_heap_base, ulong _frame_base, __constant uchar *_constant_region, __local uchar *_local_region, __global int *_atomics)
{
  double d_85, d_87, d_86, d_73, d_72, d_75, d_74, d_77, d_76, d_65, d_64, d_67, d_66, d_69, d_68, d_71, d_70, d_57, d_56, d_59, d_58, d_61, d_60, d_63, d_62, d_113, d_112, d_115, d_114, d_117, d_116, d_105, d_104, d_107, d_106, d_109, d_108, d_111, d_110, d_97, d_96, d_99, d_98, d_101, d_100, d_103, d_102, d_89, d_88, d_91, d_90, d_93, d_92, d_95, d_94, d_17, d_19, d_18, d_20, d_23, d_22, d_15, d_48, d_53, d_52, d_55, d_54, d_41, d_43, d_42, d_45, d_44, d_47, d_46, d_33, d_32, d_35, d_34, d_37, d_36, d_38, d_25, d_27, d_26, d_29, d_28, d_31, d_30;
  int i_79, i_78, i_49, i_51, i_50, i_21, i_84, i_39, i_118, i_24, i_40, i_10;
  long l_11, l_82, l_80, l_81, l_12, l_13;
  ulong ul_0, ul_16, ul_14, ul_83, ul_2, ul_1;

  __global ulong *_frame = (__global ulong *) &_heap_base[_frame_base];

  // BLOCK 0
  ul_0  =  (ulong) _frame[3];
  ul_1  =  (ulong) _frame[4];
  ul_2  =  (ulong) _frame[9];
  __private double ul_3[1494001];
  __private double ul_4[1494001];
  __private double ul_5[1494001];
  __private double ul_6[1494001];
  __private double ul_7[1494001];
  __private double ul_8[1494001];
  __private double ul_9[1494001];
  // BLOCK 1 MERGES [0 5 ]
  i_10  =  0;
  for(;i_10 < 165;)
  {
    // BLOCK 2
    l_11  =  (long) i_10;
    l_12  =  l_11 << 3;
    l_13  =  l_12 + 24L;
    ul_14  =  ul_0 + l_13;
    d_15  =  *((__global double *) ul_14);
    ul_16  =  ul_1 + l_13;
    d_17  =  *((__global double *) ul_16);
    d_18  =  d_15 * 0.37699111843077515;
    d_19  =  native_cos(d_18);
    d_20  =  native_sin(d_18);
    i_21  =  get_global_id(1);
    // BLOCK 3 MERGES [2 4 ]
    d_22  =  d_20;
    d_23  =  d_19;
    i_24  =  i_21;
    for(;i_24 < 1494001;)
    {
      // BLOCK 4
      d_25  =  ul_3[i_24];
      d_26  =  d_22 + d_25;
      ul_3[i_24]  =  d_26;
      d_25  =  d_26;
      d_27  =  ul_4[i_24];
      d_28  =  d_23 + d_27;
      ul_4[i_24]  =  d_28;
      d_27  =  d_28;
      d_29  =  ul_5[i_24];
      d_30  =  fma(d_22, d_22, d_29);
      ul_5[i_24]  =  d_30;
      d_29  =  d_30;
      d_31  =  ul_6[i_24];
      d_32  =  fma(d_23, d_23, d_31);
      ul_6[i_24]  =  d_32;
      d_31  =  d_32;
      d_33  =  ul_7[i_24];
      d_34  =  fma(d_22, d_23, d_33);
      ul_7[i_24]  =  d_34;
      d_33  =  d_34;
      d_35  =  ul_8[i_24];
      d_36  =  fma(d_22, d_17, d_35);
      ul_8[i_24]  =  d_36;
      d_35  =  d_36;
      d_37  =  ul_9[i_24];
      d_38  =  fma(d_23, d_17, d_37);
      ul_9[i_24]  =  d_38;
      d_37  =  d_38;
      i_39  =  get_global_size(1);
      i_40  =  i_39 + i_24;
      d_41  =  d_15 * 6.283185307179587E-5;
      d_42  =  native_cos(d_41);
      d_43  =  d_42 * d_23;
      d_44  =  native_sin(d_41);
      d_45  =  d_44 * d_22;
      d_46  =  d_43 - d_45;
      d_47  =  d_44 * d_23;
      d_48  =  fma(d_42, d_22, d_47);
      d_22  =  d_48;
      d_23  =  d_46;
      i_24  =  i_40;
    }  // B4
    // BLOCK 5
    i_49  =  i_10 + 1;
    i_10  =  i_49;
  }  // B5
  // BLOCK 6
  i_50  =  get_global_id(0);
  // BLOCK 7 MERGES [6 11 ]
  i_51  =  i_50;
  for(;i_51 < 1494001;)
  {
    // BLOCK 8
    d_52  =  ul_6[i_51];
    d_53  =  ul_5[i_51];
    d_54  =  ul_7[i_51];
    d_55  =  ul_7[i_51];
    d_56  =  ul_4[i_51];
    d_57  =  ul_4[i_51];
    d_58  =  ul_5[i_51];
    d_59  =  ul_3[i_51];
    d_60  =  ul_3[i_51];
    d_61  =  ul_6[i_51];
    d_62  =  ul_3[i_51];
    d_63  =  ul_4[i_51];
    d_64  =  ul_7[i_51];
    d_65  =  d_62 * 2.0;
    d_66  =  d_65 * d_63;
    d_67  =  d_52 * d_53;
    d_68  =  d_54 * d_55;
    d_69  =  d_67 - d_68;
    d_70  =  d_69 * 165.0;
    d_71  =  d_56 * d_57;
    d_72  =  d_71 * d_58;
    d_73  =  d_70 - d_72;
    d_74  =  d_59 * d_60;
    d_75  =  d_74 * d_61;
    d_76  =  d_73 - d_75;
    d_77  =  fma(d_66, d_64, d_76);
    i_78  =  get_global_size(0);
    i_79  =  i_78 + i_51;
    l_80  =  (long) i_51;
    l_81  =  l_80 << 3;
    l_82  =  l_81 + 24L;
    ul_83  =  ul_2 + l_82;
    i_84  =  isless(d_77, 0.0);
    if(i_84 == 1)
    {
      // BLOCK 9
      *((__global double *) ul_83)  =  1.0;
    }  // B9
    else
    {
      // BLOCK 10
      d_85  =  ul_4[i_51];
      d_86  =  ul_8[i_51];
      d_87  =  ul_3[i_51];
      d_88  =  ul_9[i_51];
      d_89  =  ul_9[i_51];
      d_90  =  ul_5[i_51];
      d_91  =  ul_8[i_51];
      d_92  =  ul_7[i_51];
      d_93  =  ul_3[i_51];
      d_94  =  ul_8[i_51];
      d_95  =  ul_6[i_51];
      d_96  =  ul_9[i_51];
      d_97  =  ul_7[i_51];
      d_97  =  ul_7[i_51];
      d_97  =  ul_7[i_51];
      d_97  =  ul_7[i_51];
      d_98  =  ul_4[i_51];
      d_99  =  ul_9[i_51];
      d_100  =  ul_8[i_51];
      d_101  =  d_85 * d_86;
      d_102  =  d_87 * d_88;
      d_103  =  d_101 - d_102;
      d_104  =  d_89 * d_90;
      d_105  =  d_91 * d_92;
      d_106  =  d_104 - d_105;
      d_107  =  d_106 * 165.0;
      d_108  =  fma(d_103, d_93, d_107);
      d_109  =  d_94 * d_95;
      d_110  =  d_96 * d_97;
      d_111  =  d_109 - d_110;
      d_112  =  d_111 * 165.0;
      d_113  =  d_98 * d_103;
      d_114  =  d_112 - d_113;
      d_115  =  d_114 * d_100;
      d_116  =  fma(d_108, d_99, d_115);
      d_117  =  d_116 / d_77;
      *((__global double *) ul_83)  =  d_117;
    }  // B10
    // BLOCK 11 MERGES [9 10 ]
    i_118  =  i_79;
    i_51  =  i_118;
  }  // B11
  // BLOCK 12
  return;
}  //  kernel

[TornadoVM-OCL-JNI] ERROR : clCreateKernel -> Returned: -5
[TornadoVM-OCL-JNI] ERROR : clGetKernelInfo -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clSetKernelArg -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clSetKernelArg -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clSetKernelArg -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clSetKernelArg -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clSetKernelArg -> Returned: 4294967248
[TornadoVM-OCL-JNI] ERROR : clEnqueueNDRangeKernel -> Returned: -48
[TornadoVM-OCL-JNI] ERROR : clWaitForEvents -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
[TornadoVM-OCL-JNI] ERROR : clGetEventProfilingInfo -> Returned: -58
{
    "PeriodSearch": {
        "TOTAL_BYTE_CODE_GENERATION": "7627479",
        "COPY_IN_TIME": "3906",
        "TOTAL_TASK_SCHEDULE_TIME": "1284552037",
        "TOTAL_KERNEL_TIME": "0",
        "COPY_OUT_TIME": "972347",
        "TOTAL_GRAAL_COMPILE_TIME": "88152834",
        "TOTAL_DRIVER_COMPILE_TIME": "343693118",
        "TOTAL_DISPATCH_DATA_TRANSFERS_TIME": "23902",
        "TOTAL_DISPATCH_KERNEL_TIME": "0",
        "TOTAL_COPY_IN_SIZE_BYTES": "5376",
        "TOTAL_COPY_OUT_SIZE_BYTES": "11952032",
        "PeriodSearch.t0": {
            "METHOD": "MethodLeastSquareGPU.computeGPU",
            "DEVICE_ID": "0:1",
            "DEVICE": "Intel(R) Iris(TM) Pro Graphics P580 [0x193a]",
            "TOTAL_COPY_IN_SIZE_BYTES": "80",
            "TASK_KERNEL_TIME": "0",
            "TASK_COMPILE_DRIVER_TIME": "343693118",
            "TASK_COMPILE_GRAAL_TIME": "88152834"
        }
    }
}
jjfumero commented 2 years ago

There are a few things here:

From the code being generated:

How to parallelize the Java application:

I suggest splitting the task (method) into several tasks. Each method runs a parallel loop. For the sequential part, it will need to run on the host side and call a parallel kernel per iteration. Having smaller methods (compute-kernels) has the consequence of having more space for private memory of large arrays.

Hope this helps.

yazun commented 2 years ago

Thanks for checking and for the hints!

After few iterations it's quite clear that this particular algorithm is a hard case - there too many intra-iteration dependencies between computations to disentangle them. It would be easier if some of the loops could be marked as non-parallelizable in the kernel code, but from what I understand there's no such control - i.e. one cannot specify that only the outer loop is parallelized when we have two loops? We have the code running now, but obviously it does not give proper results due to

private static void computeGPU(double[] cPhArr,double sPhArr[], double[] cTArr, double[] normObsTimes, double[] normObsValues,  double[] sumSx, double[] sumCx, double[] sumSx2, double[] sumCx2, double[] sumSxCx, double[] sumSxVal, double[] sumCxVal, double omega, double deltaOmega, double deltaEpsilon)
    {
        int nObservations = normObsTimes.length;
        int nScanFrequencies = cPhArr.length;

        // for each set of observation data 
        for (@Parallel int i = 0; i < nObservations; i++)   {
            final double obsValue = normObsValues[i];

            // for each frequency to test, increment the phase with the phase step
            for ( int j = 0; j < nScanFrequencies; j++) {
                sumSx[j] += sPhArr[j];
                sumCx[j] += cPhArr[j];
                sumSx2[j] += sPhArr[j]*sPhArr[j];
                sumCx2[j] += cPhArr[j]*cPhArr[j];
                sumSxCx[j] += cPhArr[j]*sPhArr[j];
                sumSxVal[j] += sPhArr[j]*obsValue;
                sumCxVal[j] += cPhArr[j]*obsValue;

            }
        }
...
TaskSchedule task = new TaskSchedule("PeriodSearch") 
            .streamIn(sumSx, sumCx, sumSx2, sumCx2, sumSxCx, sumSxVal, sumCxVal)
                .task("t0", MethodLeastSquareGPU::computeGPU, cPhArr,sPhArr, cTArr, normObsTimes, normObsValues,  sumSx, sumCx, sumSx2, sumCx2, sumSxCx, sumSxVal, sumCxVal, omega, deltaOmega, deltaEpsilon) 
                .streamOut(amplitudes,sumSx, sumCx, sumSx2, sumCx2, sumSxCx, sumSxVal, sumCxVal);
        task.execute();

will not give mathematically valid result. While not surprising I wonder if such cases could be intercepted by the kernel compiler so a warning/error is thrown?

jjfumero commented 2 years ago

Thank you for the suggestion. Yes I agree TornadoVM should be able to at least, throw the error and bailout. We annotate this for future work.

Meanwhile, keep in mind TornadoVM offers two APIs:

It might be a better fit for your usecase.

Hope this helps.

yazun commented 2 years ago

Yes, porting to Kernel Parallel API will be the next step now. Thanks!

jjfumero commented 2 years ago

@yazun , is this issue solved? Can this issue be closed?

yazun commented 2 years ago

yes, thanks!