ptxas/nvcc build failure on CUDA >= 9.2, related to templates/templated functions

antmatyjajo commented 5 years ago

When trying to build something that uses links.cuh (examples: intercalation, intercalation_w_gradient, sorting_prot; tests: test_links), nvcc fails at the first ptxas step with the following error:

$ nvcc --verbose -std=c++11 -arch=sm_61 intercalation.cu 

#[ ... cut the info specific to my machine ... ]

#$ ptxas -arch=sm_61 -m64  "/tmp/tmpxft_00002dbd_00000000-5_intercalation.ptx"  -o "/tmp/tmpxft_00002dbd_00000000-9_intercalation.sm_61.cubin" 
ptxas fatal   : Unresolved extern function '_Z12linear_forceIfEvPT_S1_'
# --error 0xff --

The error always appears this way, and only happens for the linear_force function.

It seems this is related to the definition of the link_forces template, specifically the default template parameter Link_force<Pt> force = linear_force<Pt> (see here).

Specifying the template parameters manually when binding link_forces (rather than relying on automatic generation through the default template parameter) works fine, e.g., in the case of test_links.cu (see here):

// Original code, fails to build:
auto forces = [&links](const float3* __restrict__ d_X, float3* d_dX) {
    return link_forces(links, d_X, d_dX);
};

// Specify template parameters, works fine:
auto forces = [&links](const float3* __restrict__ d_X, float3* d_dX) {
    return link_forces<float3, linear_force>(links, d_X, d_dX);
};

I think this is a problem with templates and nvcc, and not with Yalla. Here is a minimal-ish example showing the problem independent of Yalla.

For reference, I tried different CUDA versions and the code builds fine on CUDA 8.0.44, 9.0.176 and 9.1.85. It only fails for CUDA 9.2.88 and greater.

germannp commented 5 years ago

Thanks a lot, well done, makes me happy :-)

Might avoiding the default template parameter by having two templates in links.cuh do the trick for us? Something like:

template<typename Pt = float3>
void link_forces(Links& links, const Pt* __restrict__ d_X, Pt* d_dX)
{
    link<Pt, linear_force<Pt>><<<(links.get_d_n() + 32 - 1) / 32, 32>>>(
        d_X, d_dX, links.d_link, links.get_d_n(), links.strength);
}

template<typename Pt = float3, Link_force<Pt> force>
void link_forces(Links& links, const Pt* __restrict__ d_X, Pt* d_dX)
{
    link<Pt, force><<<(links.get_d_n() + 32 - 1) / 32, 32>>>(
        d_X, d_dX, links.d_link, links.get_d_n(), links.strength);
}

The default float3 also really irritates me, can we do without? Could you have a look, @antmatyjajo, please?

Also, we should add some tests making sure those things work with a Pt that is not flota3 and with custom forces. Maybe something like (sorry about my punch-card-style coding ;-)):

template<typename Pt>
__device__ void custom_force(const Pt* __restrict__ d_X, const int a,
    const int b, const float strength, Pt* d_dX)
{
    atomicAdd(&d_dX[a].w, -1);
    atomicAdd(&d_dX[b].w, 1);
}

const char* test_custom_force()
{
    Solution<float4, Tile_solver> points{2};
    Links links{4};
    auto forces = [&links](const float4* __restrict__ d_X, float4* d_dX) {
        return link_forces(links, d_X, d_dX);
    };
    auto custom_forces = [&links](const float4* __restrict__ d_X, float4* d_dX) {
        return link_forces<custom_force>(links, d_X, d_dX);
    };

    // clang-format off
    points.h_X[0].x = 1;  points.h_X[0].y = 1;  points.h_X[0].z = 0; points.h_X[0].w = 1;
    points.h_X[1].x = 1;  points.h_X[1].y = -1; points.h_X[1].z = 0; points.h_X[1].w = -1;
    links.h_link[0].a = 0; links.h_link[0].b = 1;
    // clang-format on
    points.copy_to_device();
    links.copy_to_device();

    auto dt = 0.1;
    points.take_step<no_pw_int>(dt, forces);
    points.take_step<no_pw_int>(dt, custom_forces);

    points.copy_to_host();
    MU_ASSERT("Not close in x", isclose(points.h_X[0].x - points.h_X[1].x, 0));
    MU_ASSERT("Not close in y", isclose(points.h_X[1].y - points.h_X[2].y, 2 - 2 * dt * links.strength));
    MU_ASSERT("Not close in z", isclose(points.h_X[2].z - points.h_X[3].z, 0));
    MU_ASSERT("Not close in w", isclose(points.h_X[2].w - points.h_X[3].w, 2 - 2 * dt));

    return NULL;
}

Could you try to get that to work, @mmarinriera, please?

antmatyjajo commented 5 years ago

OK, thanks for your suggestions, all working now :D

Might avoiding the default template parameter by having two templates in links.cuh do the trick for us? Something like: [...] The default float3 also really irritates me, can we do without?

Implemented these in a6d2a8c45ee6d7f568c7fbf94a57f69830153342 Default float3 has been removed, type inference seems to propagate correctly.

Also, we should add some tests making sure those things work with a Pt that is not flota3 and with custom forces.

Implemented your code with some minor changes (indexing in the assertions to account for only n=2 points, changing no_pw_int function to use float4) in fca6bd7adf8dc7375fc8fd189fac5e5a9ce19576

All compile and run on Ubuntu 18.04, cuda 9.2.148, gcc 7.3.0

Also compiled with cuda 8 through 10 ok on the cluster, though unfortunately I haven't had time to run everything in that environment.

germannp / yalla

ptxas/nvcc build failure on CUDA >= 9.2, related to templates/templated functions #17