Use dynamic address of global and closure variables

dlee992 commented 1 month ago

Fixes https://github.com/numba/numba/issues/9084.

Subtasks:

[x] change make_constant_array for CUDATarget: Copied from Graham's POC and guided it by USE_NV_BINDING.
[x] Making copies of data to ~~/ from~~ the device at launch time for global and closure variables on the host needs to be considered.
[ ] CUDA documentation should clarify that const copies are never automatically made, and indicate the const array constructor for their explicit construction.
[ ] Tests need to be added for both global and closure variable cases

Subtasks need to discuss before writing the code:

[x] related to Point 2 above, allowing users to write this global or closure variables in kernel function or not? Since POC stills use make_constant_array, it enforces the underlying array is read-only.
[ ] Any other memory management considerations as required (e.g. across multiple devices, avoidance of leaks when making implicit copies from the host, etc.)

Subtasks should be handled in numba core:

[ ] https://github.com/numba/numba/pull/9642
[ ] Caching needs to be disabled for kernels with references to globals and closure variables. The docstring for BaseContext.add_dynamic_addr() suggests that addition of a dynamic address will disable caching, but this does not seem to be the case for CUDA at least.
[ ] a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry

dlee992 commented 1 month ago

From Numba office hour, I will treat global/closure variables as read-only in this PR.

dlee992 commented 1 month ago

Ha, interesting thing is after largely changing the behaviour of make_constant_array in CudaTarget, which breaks 5 tests for our original const array.

e.g., numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size, since make_constant_array is also serviced for const array creation.

FAIL: test_get_const_mem_size (numba.cuda.tests.cudadrv.test_linker.TestLinker.test_get_const_mem_size)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/dli/numba-cuda/numba_cuda/numba/cuda/tests/cudadrv/test_linker.py", line 262, in test_get_const_mem_size
    self.assertGreaterEqual(const_mem_size, CONST1D.nbytes)
AssertionError: 0 not greater than or equal to 80

So need to keep the legacy behaviour (copying and creating a CUDA const array if users specify), and need to use a flag to mark if this's intentional behaviour or it's called from const.array_like series.

dlee992 commented 1 month ago

seems Numba core treats ir.Const, ir.Global, ir.FreeVar in the same way:

    def lower_assign(self, ty, inst):
        value = inst.value
        # In nopython mode, closure vars are frozen like globals
        if isinstance(value, (ir.Const, ir.Global, ir.FreeVar)):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

I think we need to add a strategy to distinguish global and const. A direct way is add get_global_generic, then we register different methods for different targets. But this could need bunch of code changes/additions in Numba core. Then change the branch above to sth below:

        if isinstance(value, ir.Const):
            res = self.context.get_constant_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

        if isinstance(value, (ir.Global, ir.FreeVar)):
            res = self.context.get_global_generic(self.builder, ty,
                                                    value.value)
            self.incref(ty, res)
            return res

Do you have a better idea? @gmarkall

gmarkall commented 1 month ago

Do you have a better idea? @gmarkall

I'm thinking maybe we can get around this by editing the lowering of constants for arrays in the CUDA target. Presently all globals that are arrays will be lowered by the lowering in arrayobj.py: https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/np/arrayobj.py#L3282-L3287

Perhaps if we register a @lower_constant(types.Array) with the CUDA target we can put the implementation there instead, and leave make_constant_array unchanged in the CUDA target.

To ensure that we then still get a constant array in the CUDA target when we call cuda.const.array_like(), we'd have to then change its lowering so that it calls make_constant_array() instead of doing nothing as it presently does:

https://github.com/NVIDIA/numba-cuda/blob/91f92d958a653e2e42f2705ebc46eaaf22b3a4e3/numba_cuda/numba/cuda/cudaimpl.py#L74-L78

What do you think? Could this work?

dlee992 commented 1 month ago

Presently all globals that are arrays will be lowered by the lowering in arrayobj.py

Agreed.

Perhaps if we register a @lower_constant(types.Array) with the CUDA target

I checked the definition and usage of lower_constant, which doesn't provide a way to specify target.
If we register a new @lower_constant(types.Array) in the registry list, numba will append this new one at the end of the list.
But when numba tries to find a lower_constant definition for types.Array, it will find the old one first, then directly use it.
If we want to let new definition work, we have to pay attention to the import/definition order. This way is too tricky.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

What do you think?

gmarkall commented 1 month ago

I think you can do

from numba.cuda.cudaimpl import lower_constant

to get a decorator that registers lowerings for the CUDA target context only, which is how I believe we avoid these clashes when using the low-level API at present.

dlee992 commented 1 month ago

Oh, yes! Thanks for pointing this out. I didn't notice CUDA creates a new Registry instance for itself. https://github.com/NVIDIA/numba-cuda/blob/91f92d958a653e2e42f2705ebc46eaaf22b3a4e3/numba_cuda/numba/cuda/cudaimpl.py#L19-L22

dlee992 commented 1 month ago

I added a new definition for @lower_constant(types.Array) in cudaimpl.py locally (didn't push yet), but right now CUDATargetContext loads the old definition in arrayobj.py first, which makes the new definition invalid/unused.

Need to reorder these imports to let CUDA additional registries be called before numba default registries. https://github.com/NVIDIA/numba-cuda/blob/91f92d958a653e2e42f2705ebc46eaaf22b3a4e3/numba_cuda/numba/cuda/target.py#L94-L106

Updates: seems reordering the import and install_registry is still not enough.

dlee992 commented 1 month ago

turns out the root cause is this function:

https://github.com/numba/numba/blob/fca98287a8fc9da1825ec4e6b570ef8eeaf4605c/numba/core/base.py#L56-L64

    def _select_compatible(self, sig):
        """
        Select all compatible signatures and their implementation.
        """
        out = {}
        for ver_sig, impl in self.versions:
            if self._match_arglist(ver_sig, sig):
                out[ver_sig] = impl
        return out

I added some prints during the loop and after the loop:

ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.cuda.cudaimpl impl: <function constant_array at 0x14a5f40a62a0>
ver_sig: (<class 'numba.core.types.npytypes.Array'>,), module: numba.np.arrayobj impl: <function constant_array at 0x14a5f41eafc0>
out: {(<class 'numba.core.types.npytypes.Array'>,): <function constant_array at 0x14a5f41eafc0>}

So if the TargetContext has multiple lower_constant definition for the same type signature, it chooses the last one in the registry. Somehow in current registry order, the default definition is chosen.

Need to reorder these imports to let CUDA additional registries be called before numba default registries.

In the contrary, we need to ensure CUDA-specific registries are installed after default registries, to let the CUDA specific one be chosen.

dlee992 commented 1 month ago

If I apply this change locally, finally the registry will work as expected: found the CUDA-specific definition for @lower_constant(types.Array):

diff --git a/numba_cuda/numba/cuda/target.py b/numba_cuda/numba/cuda/target.py
index be9487c..fd8b2b0 100644
--- a/numba_cuda/numba/cuda/target.py
+++ b/numba_cuda/numba/cuda/target.py
@@ -10,6 +10,7 @@ from numba.core.base import BaseContext
 from numba.core.callconv import BaseCallConv, MinimalCallConv
 from numba.core.typing import cmathdecl
 from numba.core import datamodel
+from numba.core.imputils import builtin_registry

 from .cudadrv import nvvm
 from numba.cuda import codegen, nvvmutils, ufuncs
@@ -109,6 +110,7 @@ class CUDATargetContext(BaseContext):
         # fix for #8940
         from numba.np.unsafe import ndarray # noqa F401

+        self.install_registry(builtin_registry)
         self.install_registry(cudaimpl.registry)
         self.install_registry(cffiimpl.registry)
         self.install_registry(printimpl.registry)
@@ -117,6 +119,10 @@ class CUDATargetContext(BaseContext):
         self.install_registry(mathimpl.registry)
         self.install_registry(vector_types.impl_registry)

+    def refresh(self):
+        self.load_additional_registries()
+        self.typing_context.refresh()
+
     def codegen(self):
         return self._internal_codegen

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry @gmarkall What do you think?

gmarkall commented 1 month ago

This should be a Numba bug in the lower registry order between builtin_registry and cudaimpl.registry @gmarkall What do you think?

I think you're right - thanks for locating how the order of implementations is considered. I think up until this point, nobody actually knew how Numba actually decided on which implementation is used.

gmarkall commented 1 month ago

@dlee992 Would you like to push your local changes, and we'll see how they go on CI? Or do you still have local outstanding issues with the test suite?

copy-pr-bot[bot] commented 1 month ago

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

dlee992 commented 1 month ago

I pushed my workaround for taking care of the registry ordering.

However, I am unsure how to make make_constant_array working as before, since the last argument ary becomes from a real python ndarray to a ir.LoadInstr, which comes from the new @lower_constant(types.Array). I have to admit I'm not good at figuring out how to generate LLVM IR with llvmlite interface. Seems it needs to get the underlying array from this ir.LoadInstr somehow. Perhaps I better read LLVM IR reference doc again.

In theory, global and const should have their own ways for lowering. But now they use the same way since numba's legacy issue and design choice.

BTW, it's kinda going back to this issue. If we can choose a different way for ir.Global and cuda.const.array_like at the very beginning, it would simplify the lowering parts.

NVIDIA / numba-cuda

Use dynamic address of global and closure variables #15