Why is GPU implementation significantly slower than CPU?

I was trying the GPU example script:

from devito import *
import matplotlib.pyplot as plt

nx, ny = 100, 100
grid = Grid(shape=(nx, ny))

u = TimeFunction(name='u', grid=grid, space_order=2, save=200)
c = Constant(name='c')

eqn = Eq(u.dt, c * u.laplace)

step = Eq(u.forward, solve(eqn, u.forward))

xx, yy = np.meshgrid(np.linspace(0., 1., nx, dtype=np.float32),
                     np.linspace(0., 1., ny, dtype=np.float32))
r = (xx - .5) ** 2. + (yy - .5) ** 2.
u.data[0, np.logical_and(.05 <= r, r <= .1)] = 1.

op = Operator([step])

stats = op.apply(dt=5e-05, c=.5)

plt.rcParams['figure.figsize'] = (20, 20)
for i in range(1, 6):
    plt.subplot(1, 6, i)
    plt.imshow(u.data[(i - 1) * 40])
plt.show()

The CPU version op = Operator([step]) returned

Operator Kernel ran in 0.01 s

However, the GPU version op = Operator([step], platform='nvidiaX', opt=('advanced', {'gpu-fit': u})) returned

Operator Kernel ran in 4.74 s

My CPU is Intel Xeon Gold 6133 * 80. My GPU is NVIDIA GeForce RTX 4080, with cuda 11.8 and NVIDIA HPC SDK 22.11, which works normally for other programs (e.g. PyTorch).

Any idea on what is going on here?

Thank you in advance!

devitocodes / devito

Why is GPU implementation significantly slower than CPU? #2420