GPU kernels in shared libraries loaded after context initialization not found

jglaser commented 4 years ago

When shared libraries containing GPU kernels are loaded (in python using import ..) after the GPU context has already been initialized and kernels have already been launched, HIP is unaware of the newly loaded kernels. The following python script (requiring HOOMD-blue (hip branch) and hoomd-benchmarks (next branch))demonstrates this issue. This may not be a minimal reproducer but it is as minimal as I can currently provide.

import signac
project = signac.get_project()
job = list(project.find_jobs({'benchmark': 'lj_liquid', 'n': 100}))[0]

sp = job.statepoint()

# uncommenting this line makes the error go away
# from hoomd import hpmc

with job:
    import hoomd
    from hoomd import md

    device = hoomd.device.GPU()
    c = hoomd.context.initialize(args='',device=device)
    system = hoomd.init.read_gsd(filename=job.fn('init.gsd'))
    nl = md.nlist.cell()
    lj = md.pair.lj(r_cut=3.0, nlist=nl)
    lj.pair_coeff.set('A', 'A', epsilon=1.0, sigma=1.0)

    md.integrate.mode_standard(dt=0.005)
    md.integrate.nvt(group=hoomd.group.all(), kT=1.2, tau=0.5)

    nl.set_params(r_buff=0.6, check_period=7)

    hoomd.run(1)

import signac
import numpy as np
import math

benchmark_name = 'dodecahedron'

# dodecahedron shape
phi = (1. + math.sqrt(5.))/2.
inv = 2./(1. + math.sqrt(5.))
points = [
          (-1,-1,-1),
          (-1,-1, 1),
          (-1, 1,-1),
          (-1, 1, 1),
          ( 1,-1,-1),
          ( 1,-1, 1),
          ( 1, 1,-1),
          ( 1, 1, 1),
          ( 0,-inv,-phi),
          ( 0,-inv, phi),
          ( 0, inv,-phi),
          ( 0, inv, phi),
          (-inv,-phi, 0),
          (-inv, phi, 0),
          ( inv,-phi, 0),
          ( inv, phi, 0),
          (-phi, 0,-inv),
          (-phi, 0, inv),
          ( phi, 0,-inv),
          ( phi, 0, inv)
         ]

V = 14.4721 # Mathematica
circ_r = np.max(np.linalg.norm(np.array(points), axis=1))

job = list(project.find_jobs({'benchmark': 'dodecahedron'}))[0]
sp = job.statepoint()

with job:
    import hoomd
    from hoomd import hpmc

    device = hoomd.device.GPU()
    c = hoomd.context.initialize(args='',device=device)
    system = hoomd.init.read_gsd(filename=job.fn('init.gsd'))

    # setup the MC integration
    mc = hpmc.integrate.convex_polyhedron(seed=10, d=0.3, a=0.26);
    mc.shape_param.set("A", vertices=points);

    hoomd.run(1)

I get the following error (after some output confirming that the first part of the script executes successfully)

notice(2): Group "all" created containing 125000 particles
** starting run **
Traceback (most recent call last):
  File "test.py", line 79, in <module>
    hoomd.run(1)
  File "/home/michigan/miniconda3/envs/py38/lib/python3.8/site-packages/hoomd/__init__.py", line 199, in run
RuntimeError: Invalid function passed to hipLaunchKernelGGL.

If I uncomment the highlighted line, i.e., load the library hpmc containing the additional kernel symbols before executing any other kernel, the error goes away.

I observed this behavior with a Vega 20 GPU on a custom HIP branch, but it should also be reproducible with HIP 2.10 or master.

nartmada commented 6 months ago

Hi @jglaser, apologies for the lack of response. Just want to confirm if this issue has been fixed with latest ROCm 6.0.2 (HIP 6.0.32831). Thanks,

nartmada commented 6 months ago

Closing the issue as it is stale and also no response from @jglaser. Please re-open if this issue still exists with latest ROCm 6.0.2 (HIP 6.0.32831). Thanks.

ROCm / HIP

GPU kernels in shared libraries loaded after context initialization not found #1741