Describe the bug
Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones.
Right now if users accidentally pass in the wrong dataset type, it results in errors/stacktraces that can often be misleading and not give a lot of insight into the source of the error.
There should be more high level checks in place that checks the backend type beforehand, and raises appropriate errors with suggestions on how to switch between backends.
Steps/Code to reproduce bug
Calling the ftfy modifier on a GPU dataframe results in the following error:
Traceback (most recent call last):
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/indexed_frame.py", line 3457, in _apply
kernel, retty = _compile_or_get(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/utils.py", line 274, in _compile_or_get
kernel, scalar_return_type = kernel_getter(frame, func, args)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/scalar_function.py", line 55, in _get_scalar_kernel
scalar_return_type = _get_udf_return_type(sr_type, func, args)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/nvtx/nvtx.py", line 116, in inner
result = func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/core/udf/utils.py", line 94, in _get_udf_return_type
ptx, output_type = cudautils.compile_udf(func, compile_sig)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/cudf/utils/cudautils.py", line 126, in compile_udf
ptx_code, return_type = cuda.compile_ptx_for_current_device(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 351, in compile_ptx_for_current_device
return compile_ptx(pyfunc, sig, debug=debug, lineinfo=lineinfo,
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 315, in compile_ptx
cres = compile_cuda(pyfunc, return_type, args, debug=debug,
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/cuda/compiler.py", line 196, in compile_cuda
cres = compiler.compile_extra(typingctx=typingctx,
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 751, in compile_extra
return pipeline.compile_extra(func)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 445, in compile_extra
return self._compile_bytecode()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 513, in _compile_bytecode
return self._compile_core()
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 492, in _compile_core
raise e
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler.py", line 479, in _compile_core
pm.run(self.state)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 368, in run
raise patched_exception
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 356, in run
self._runPass(idx, pass_inst, state)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_lock.py", line 35, in _acquire_compile_lock
return func(*args, **kwargs)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 311, in _runPass
mutated |= check(pss.run_pass, internal_state)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/compiler_machinery.py", line 273, in check
mangled = func(compiler_state)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typed_passes.py", line 112, in run_pass
typemap, return_type, calltypes, errs = type_inference_stage(
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typed_passes.py", line 93, in type_inference_stage
errs = infer.propagate(raise_errors=raise_errors)
File "/opt/conda/envs/rapids/lib/python3.10/site-packages/numba/core/typeinfer.py", line 1091, in propagate
raise errors[0]
numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
[1m[1mUnknown attribute 'fix_text' of type Module(<module 'ftfy' from '/opt/conda/envs/rapids/lib/python3.10/site-packages/ftfy/__init__.py'>)
[1m
File "../../opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py", line 25:[0m
[1m def modify_document(self, text):
[1m return ftfy.fix_text(text)
[0m [1m^[0m[0m
[0m
[0m[1mDuring: typing of get attribute at /opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py (25)[0m
[1m
File "../../opt/conda/envs/rapids/lib/python3.10/site-packages/nemo_curator/modifiers/unicode_reformatter.py", line 25:[0m
[1m def modify_document(self, text):
[1m return ftfy.fix_text(text)
Expected behavior
better checks and error messages
In addition to these checks a few examples showing the transition would be helpful as well, and maybe even exposing the method to the high level DocumentDataset class.
Describe the bug Some modules in Curator only support working with CPU datasets, and others only support working on GPU ones. Right now if users accidentally pass in the wrong dataset type, it results in errors/stacktraces that can often be misleading and not give a lot of insight into the source of the error.
There should be more high level checks in place that checks the backend type beforehand, and raises appropriate errors with suggestions on how to switch between backends.
Steps/Code to reproduce bug
Calling the ftfy modifier on a GPU dataframe results in the following error:
Expected behavior
better checks and error messages
In addition to these checks a few examples showing the transition would be helpful as well, and maybe even exposing the method to the high level DocumentDataset class.