ddemidov / vexcl

VexCL is a C++ vector expression template library for OpenCL/CUDA/OpenMP
http://vexcl.readthedocs.org
MIT License
699 stars 82 forks source link

Limit on length of multivector expressions? #46

Closed ds283 closed 11 years ago

ds283 commented 11 years ago

Apologies if this is a dumb question.

I am trying to use VexCL with odeint to solve a set of ODEs for different parameter values. The ODEs are produced by another program which takes an algebraic expression as input, performs some symbolic manipulations, and writes out the ODEs as C++ which is then compiled. Because they are produced automatically, the symbolic expressions specifying the system of ODEs can become lengthy.

The specific example causing trouble is (all one expression)

dxdt(3) =  3.0*( 3.0*sin(1.0/(M_chi*M_chi*M_chi)*(chi*chi*chi))*((__Mp*__Mp)*
(__Mp*__Mp))/(M_chi*M_chi*M_chi)*(chi*chi)+-100.0*(M_chi*M_chi)*chi)*
(__Mp*__Mp)/( 50.0*(M_phi*M_phi)*(phi*phi)+50.0*(__dphi*__dphi)+50.0*
(__dchi*__dchi)+((__Mp*__Mp)*(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*
(chi*chi*chi))+50.0*(M_chi*M_chi)*(chi*chi))+3.0*__dchi*( 1.0/( 6.0*( 50.0*
(M_phi*M_phi)*(phi*phi)+((__Mp*__Mp)*(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*
(chi*chi*chi))+50.0*(M_chi*M_chi)*(chi*chi))*(__Mp*__Mp)/( 50.0*(M_phi*M_phi)*
(phi*phi)+50.0*(__dphi*__dphi)+50.0*(__dchi*__dchi)+((__Mp*__Mp)*
(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*(chi*chi*chi))+50.0*(M_chi*M_chi)*
(chi*chi))+(__dphi*__dphi)+(__dchi*__dchi))*( (__dphi*__dphi)+(__dchi*__dchi))-1.0);

where phi, chi, __dphi and d_chi are macros resolving to the components of a multivector x of type vex::multivector<double> representing the current state, and M_phi, M_chi and __Mp are double quantities held on the host.

#define phi (x(0))
#define chi (x(1))
#define __dphi (x(2))
#define __dchi (x(3))

This compiles fine, but the OpenCL kernel representing the right-hand side fails when executed

libc++abi.dylib: terminate called throwing an exception

I am using Apple LLVM 4.2 on OS X and an up-to-date version of VexCL.

If I change this by breaking the right-hand side in two (but making no other changes), to give

dxdt(3) =  3.0*( 3.0*sin(1.0/(M_chi*M_chi*M_chi)*(chi*chi*chi))*((__Mp*__Mp)*
(__Mp*__Mp))/(M_chi*M_chi*M_chi)*(chi*chi)+-100.0*(M_chi*M_chi)*chi)*
(__Mp*__Mp)/( 50.0*(M_phi*M_phi)*(phi*phi)+50.0*(__dphi*__dphi)+50.0*
(__dchi*__dchi)+((__Mp*__Mp)*(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*
(chi*chi*chi))+50.0*(M_chi*M_chi)*(chi*chi));

dxdt(3) += +3.0*__dchi*( 1.0/( 6.0*( 50.0*(M_phi*M_phi)*(phi*phi)+((__Mp*__Mp)*
(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*(chi*chi*chi))+50.0*(M_chi*M_chi)*
(chi*chi))*(__Mp*__Mp)/( 50.0*(M_phi*M_phi)*(phi*phi)+50.0*(__dphi*__dphi)+50.0*
(__dchi*__dchi)+((__Mp*__Mp)*(__Mp*__Mp))*cos(1.0/(M_chi*M_chi*M_chi)*
(chi*chi*chi))+50.0*(M_chi*M_chi)*(chi*chi))+(__dphi*__dphi)+(__dchi*__dchi))*( 
(__dphi*__dphi)+(__dchi*__dchi))-1.0);

then the resulting code works as expected and produces a correct result.

I presume that there is a limit to the size of expressions which can be handled correctly by the vex::multivector expression templates. Is this correct? If so, is there any way to predict in advance what is the longest expression which can be handled?

ddemidov commented 11 years ago

This could be a limitation of OpenCL implementation. You could try to use tagged terminals for x, M_phi, M_chi, __Mp in order to reduce the resulting expression tree size and number of input parameters. This should also improve performance, so I would recommend doing this anyway:

#define phi (vex::tag<0>(x(0)))
#define chi (vex::tag<1>(x(1)))
#define __dphi (vex::tag<2>(x(2)))
#define __dchi (vex::tag<3>(x(3)))
auto t_phi = vex::tag<4>(M_phi);
auto t_chi = vex::tag<5>(M_chi);
auto t_p   = vex::tag<6>(__Mp);

auto one   = vex::tag<7>(1.0);
auto three = vex::tag<8>(3.0);
auto fifty = vex::tag<9>(50.0);

dxdt(3) =  three*( three*sin(one/(t_chi*t_chi*t_chi)*(chi*chi*chi))*((t_p*t_p)*
(t_p*t_p))/(t_chi*t_chi*t_chi)*(chi*chi)+-100.0*(t_chi*t_chi)*chi)*
(t_p*t_p)/( fifty*(t_phi*t_phi)*(phi*phi)+fifty*(__dphi*__dphi)+fifty*
(__dchi*__dchi)+((t_p*t_p)*(t_p*t_p))*cos(one/(t_chi*t_chi*t_chi)*
(chi*chi*chi))+fifty*(t_chi*t_chi)*(chi*chi))+three*__dchi*( one/( 6.0*( fifty*
(t_phi*t_phi)*(phi*phi)+((t_p*t_p)*(t_p*t_p))*cos(one/(t_chi*t_chi*t_chi)*
(chi*chi*chi))+fifty*(t_chi*t_chi)*(chi*chi))*(t_p*t_p)/( fifty*(t_phi*t_phi)*
(phi*phi)+fifty*(__dphi*__dphi)+fifty*(__dchi*__dchi)+((t_p*t_p)*
(t_p*t_p))*cos(one/(t_chi*t_chi*t_chi)*(chi*chi*chi))+fifty*(t_chi*t_chi)*
(chi*chi))+(__dphi*__dphi)+(__dchi*__dchi))*( (__dphi*__dphi)+(__dchi*__dchi))-one);

You can check the effect of this by defining VEXCL_SHOW_KERNELS macro and comparing the generated sources for the OpenCL kernel in question.

ddemidov commented 11 years ago

I could not reproduce the issue on my Linux machine for either of the variants. But I noticed that generated kernel name may become quite long:

kernel void plus_divides_multiplies_multiplies_term_plus_multiplies_divides_multiplies_multiplies_term_sin_multiplies_divides_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_term_multiplies_term_term_multiplies_multiplies_term_term_term_multiplies_term_term_multiplies_multiplies_term_multiplies_term_term_term_multiplies_term_term_plus_plus_plus_plus_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_term_multiplies_term_term_multiplies_term_multiplies_term_term_multiplies_multiplies_multiplies_term_term_multiplies_term_term_cos_multiplies_divides_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_multiplies_term_term_minus_multiplies_divides_term_plus_plus_divides_multiplies_multiplies_term_plus_plus_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_multiplies_multiplies_term_term_multiplies_term_term_cos_multiplies_divides_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_term_term_plus_plus_plus_plus_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_term_multiplies_term_term_multiplies_term_multiplies_term_term_multiplies_multiplies_multiplies_term_term_multiplies_term_term_cos_multiplies_divides_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_term_term_multiplies_multiplies_term_multiplies_term_term_multiplies_term_term_multiplies_term_term_multiplies_term_term_plus_multiplies_term_term_multiplies_term_term_term_(
        ulong n,
        global double * prm_1_1,
        double prm_1_2,
        double prm_1_4,
        double prm_1_5,
        global double * prm_1_8,
        double prm_1_11,
        double prm_1_20,
        double prm_1_26,
        double prm_1_27,
        global double * prm_1_29,
        global double * prm_1_32,
        global double * prm_1_35,
        double prm_1_56
)
{
        for(size_t idx = get_global_id(0); idx < n; idx += get_global_size(0)) {
                prm_1_1[idx] = ( ( ( ( prm_1_2 * ( ( ( ( ( prm_1_2 * sin( ( ( prm_1_4 / ( ( prm_1_5 * prm_1_5 ) * prm_1_5 ) ) * ( ( prm_1_8[idx] * prm_1_8[idx] ) * prm_1_8[idx] ) ) ) ) * ( ( prm_1_11 * prm_1_11 ) * ( prm_1_11 * prm_1_11 ) ) ) / ( ( prm_1_5 * prm_1_5 ) * prm_1_5 ) ) * ( prm_1_8[idx] * prm_1_8[idx] ) ) + ( ( prm_1_20 * ( prm_1_5 * prm_1_5 ) ) * prm_1_8[idx] ) ) ) * ( prm_1_11 * prm_1_11 ) ) / ( ( ( ( ( ( prm_1_26 * ( prm_1_27 * prm_1_27 ) ) * ( prm_1_29[idx] * prm_1_29[idx] ) ) + ( prm_1_26 * ( prm_1_32[idx] * prm_1_32[idx] ) ) ) + ( prm_1_26 * ( prm_1_35[idx] * prm_1_35[idx] ) ) ) + ( ( ( prm_1_11 * prm_1_11 ) * ( prm_1_11 * prm_1_11 ) ) * cos( ( ( prm_1_4 / ( ( prm_1_5 * prm_1_5 ) * prm_1_5 ) ) * ( ( prm_1_8[idx] * prm_1_8[idx] ) * prm_1_8[idx] ) ) ) ) ) + ( ( prm_1_26 * ( prm_1_5 * prm_1_5 ) ) * ( prm_1_8[idx] * prm_1_8[idx] ) ) ) ) + ( ( prm_1_2 * prm_1_35[idx] ) * ( ( ( prm_1_4 / ( ( ( ( ( prm_1_56 * ( ( ( ( prm_1_26 * ( prm_1_27 * prm_1_27 ) ) * ( prm_1_29[idx] * prm_1_29[idx] ) ) + ( ( ( prm_1_11 * prm_1_11 ) * ( prm_1_11 * prm_1_11 ) ) * cos( ( ( prm_1_4 / ( ( prm_1_5 * prm_1_5 ) * prm_1_5 ) ) * ( ( prm_1_8[idx] * prm_1_8[idx] ) * prm_1_8[idx] ) ) ) ) ) + ( ( prm_1_26 * ( prm_1_5 * prm_1_5 ) ) * ( prm_1_8[idx] * prm_1_8[idx] ) ) ) ) * ( prm_1_11 * prm_1_11 ) ) / ( ( ( ( ( ( prm_1_26 * ( prm_1_27 * prm_1_27 ) ) * ( prm_1_29[idx] * prm_1_29[idx] ) ) + ( prm_1_26 * ( prm_1_32[idx] * prm_1_32[idx] ) ) ) + ( prm_1_26 * ( prm_1_35[idx] * prm_1_35[idx] ) ) ) + ( ( ( prm_1_11 * prm_1_11 ) * ( prm_1_11 * prm_1_11 ) ) * cos( ( ( prm_1_4 / ( ( prm_1_5 * prm_1_5 ) * prm_1_5 ) ) * ( ( prm_1_8[idx] * prm_1_8[idx] ) * prm_1_8[idx] ) ) ) ) ) + ( ( prm_1_26 * ( prm_1_5 * prm_1_5 ) ) * ( prm_1_8[idx] * prm_1_8[idx] ) ) ) ) + ( prm_1_32[idx] * prm_1_32[idx] ) ) + ( prm_1_35[idx] * prm_1_35[idx] ) ) ) * ( ( prm_1_32[idx] * prm_1_32[idx] ) + ( prm_1_35[idx] * prm_1_35[idx] ) ) ) - prm_1_4 ) ) );
        }
}

I wonder if this could be the reason for the crash with Apple implementation. In case terminal tagging does not help, could you please try 547299a? It assigns same name to all autogenerated kernels.

ds283 commented 11 years ago

Thanks for the very helpful responses.

Adding tags did not help, although it did reduce the size of the generated kernels. But switching to a fixed kernel name apparently solves the problem - I now get correct output.

ddemidov commented 11 years ago

I've pushed the constant names commit to the master branch. I'll do some cleanup later today, but it should be usable right away.