devitocodes / devito

DSL and compiler framework for automated finite-differences and stencil computation
http://www.devitoproject.org
MIT License
564 stars 229 forks source link

Why is GPU implementation significantly slower than CPU? #2420

Closed jinshanmu closed 4 months ago

jinshanmu commented 4 months ago

I was trying the GPU example script:

from devito import *
import matplotlib.pyplot as plt

nx, ny = 100, 100
grid = Grid(shape=(nx, ny))

u = TimeFunction(name='u', grid=grid, space_order=2, save=200)
c = Constant(name='c')

eqn = Eq(u.dt, c * u.laplace)

step = Eq(u.forward, solve(eqn, u.forward))

xx, yy = np.meshgrid(np.linspace(0., 1., nx, dtype=np.float32),
                     np.linspace(0., 1., ny, dtype=np.float32))
r = (xx - .5) ** 2. + (yy - .5) ** 2.
u.data[0, np.logical_and(.05 <= r, r <= .1)] = 1.

op = Operator([step])

stats = op.apply(dt=5e-05, c=.5)

plt.rcParams['figure.figsize'] = (20, 20)
for i in range(1, 6):
    plt.subplot(1, 6, i)
    plt.imshow(u.data[(i - 1) * 40])
plt.show()

The CPU version op = Operator([step]) returned

Operator Kernel ran in 0.01 s

However, the GPU version op = Operator([step], platform='nvidiaX', opt=('advanced', {'gpu-fit': u})) returned

Operator Kernel ran in 4.74 s

My CPU is Intel Xeon Gold 6133 * 80. My GPU is NVIDIA GeForce RTX 4080, with cuda 11.8 and NVIDIA HPC SDK 22.11, which works normally for other programs (e.g. PyTorch).

Any idea on what is going on here?

Thank you in advance!

jinshanmu commented 4 months ago

Turned to the Discussion section of devitocodes.