NVIDIA / NeMo-Curator

Scalable toolkit for data curation
Apache License 2.0
329 stars 32 forks source link

[BUG] Better error/checks around input types being CPU/GPU #79

Open ayushdg opened 1 month ago

ayushdg commented 1 month ago

Describe the bug Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones. Right now if users accidentally pass in the wrong dataset type, it results in errors/stacktraces that can often be misleading and not give a lot of insight into the source of the error.

There should be more high level checks in place that checks the backend type beforehand, and raises appropriate errors with suggestions on how to switch between backends.

Steps/Code to reproduce bug

Calling the ftfy modifier on a GPU dataframe results in the following error:

Traceback (most recent call last):
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/indexed_frame.py", line 3457, in _apply
    kernel, retty = _compile_or_get(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/utils.py", line 274, in _compile_or_get
    kernel, scalar_return_type = kernel_getter(frame, func, args)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/scalar_function.py", line 55, in _get_scalar_kernel
    scalar_return_type = _get_udf_return_type(sr_type, func, args)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
    result = func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/utils.py", line 94, in _get_udf_return_type
    ptx, output_type = cudautils.compile_udf(func, compile_sig)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/cudautils.py", line 126, in compile_udf
    ptx_code, return_type = cuda.compile_ptx_for_current_device(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 351, in compile_ptx_for_current_device
    return compile_ptx(pyfunc, sig, debug=debug, lineinfo=lineinfo,
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 315, in compile_ptx
    cres = compile_cuda(pyfunc, return_type, args, debug=debug,
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 196, in compile_cuda
    cres = compiler.compile_extra(typingctx=typingctx,
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 751, in compile_extra
    return pipeline.compile_extra(func)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 445, in compile_extra
    return self._compile_bytecode()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 513, in _compile_bytecode
    return self._compile_core()
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 492, in _compile_core
    raise e
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 479, in _compile_core
    pm.run(self.state)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 368, in run
    raise patched_exception
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 356, in run
    self._runPass(idx, pass_inst, state)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
    return func(*args, **kwargs)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
    mutated |= check(pss.run_pass, internal_state)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 273, in check
    mangled = func(compiler_state)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typed_passes.py", line 112, in run_pass
    typemap, return_type, calltypes, errs = type_inference_stage(
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typed_passes.py", line 93, in type_inference_stage
    errs = infer.propagate(raise_errors=raise_errors)
  File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typeinfer.py", line 1091, in propagate
    raise errors[0]
numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
Unknown attribute 'fix_text' of type Module(<module 'ftfy' from '/opt/conda/envs/rapids/lib/python3.10/site-packages/ftfy/__init__.py'>)

File "../../opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py", line 25:
    def modify_document(self, text):
        return ftfy.fix_text(text)
        ^

During: typing of get attribute at /opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py (25)

File "../../opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py", line 25:
    def modify_document(self, text):
        return ftfy.fix_text(text)

Expected behavior

better checks and error messages

In addition to these checks a few examples showing the transition would be helpful as well, and maybe even exposing the method to the high level DocumentDataset class.