NVIDIA-Merlin / Transformers4Rec

Transformers4Rec is a flexible and efficient library for sequential and session-based recommendation and works with PyTorch.
https://nvidia-merlin.github.io/Transformers4Rec/main
Apache License 2.0
1.13k stars 147 forks source link

RuntimeError: CUDF failure at: /__w/cudf/cudf/cpp/src/io/parquet/reader_impl_helpers.cpp:379: Invalid rowgroup index[BUG] #756

Open Oussamakhammassi opened 1 year ago

Oussamakhammassi commented 1 year ago

Tried to run the tutorial of transformers4rec and i got this error

RuntimeError Traceback (most recent call last)

in [/usr/local/lib/python3.10/dist-packages/transformers/trainer.py](https://localhost:8080/#) in evaluate(self, eval_dataset, ignore_keys, metric_key_prefix) 3005 self._memory_tracker.start() 3006 -> 3007 eval_dataloader = self.get_eval_dataloader(eval_dataset) 3008 start_time = time.time() 3009 16 frames [/usr/local/lib/python3.10/dist-packages/cudf/io/parquet.py](https://localhost:8080/#) in _read_parquet(filepaths_or_buffers, engine, columns, row_groups, use_pandas_metadata, *args, **kwargs) 819 f"following positional arguments: {list(args)}" 820 ) --> 821 return libparquet.read_parquet( 822 filepaths_or_buffers, 823 columns=columns, parquet.pyx in cudf._lib.parquet.read_parquet() parquet.pyx in cudf._lib.parquet.read_parquet() RuntimeError: CUDF failure at: /__w/cudf/cudf/cpp/src/io/parquet/reader_impl_helpers.cpp:379: Invalid rowgroup index
rnyak commented 1 year ago

@Oussamakhammassi can you please tell us how did you install transformers4rec? are you using merlin-pytorch image?

Please also start with https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/examples/getting-started-session-based examples since the tutorial nbs have not been updated recently.

Oussamakhammassi commented 1 year ago

Hi rnyak!

pip install transformers4rec[nvtabular]

No i'm not using merlin-pytorch image

rnyak commented 1 year ago

@Oussamakhammassi I'd recommend you to use docker image. Installing only transformers4rec[nvtabular] wont install cudf , dask_cudf etc.

if you want to install via pip you need to install rapids cudf and dask_cudf first (please see their doc here: https://docs.rapids.ai/install) and then install other Merlin libs as well:

Oussamakhammassi commented 1 year ago

Yess i did all that but still don't work!

rnyak commented 1 year ago

@Oussamakhammassi you need a compatible GPU and properly installed cuda driver to be able to import and use cudf library. what's your GPU specs? can you share the prints out of nvidia-smi and also nvcc --version?

rnyak commented 1 year ago

@Oussamakhammassi also can you please run this example notebooks first? https://github.com/NVIDIA-Merlin/Transformers4Rec/tree/main/examples/getting-started-session-based

Oussamakhammassi commented 1 year ago

For the version, here's the output: nvcc: NVIDIA (R) Cuda compiler driver Copyright (c) 2005-2022 NVIDIA Corporation Built on Wed_Sep_21_10:33:58_PDT_2022 Cuda compilation tools, release 11.8, V11.8.89 Build cuda_11.8.r11.8/compiler.31833905_0

For the example that you've sent to me, yes i did run it and it works well but i don't know why the other examples have this error

Oussamakhammassi commented 1 year ago

Wed Nov 8 15:51:59 2023
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 525.105.17 Driver Version: 525.105.17 CUDA Version: 12.0 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |===============================+======================+======================| | 0 Tesla T4 Off | 00000000:00:04.0 Off | 0 | | N/A 39C P8 9W / 70W | 0MiB / 15360MiB | 0% Default | | | | N/A | +-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | No running processes found | +-----------------------------------------------------------------------------+

Bharathjpv commented 10 months ago

i worked with this, example notebooks are working fine, but when i run with custom data, it throws this error with i call trainer.evaluate() method.

rnyak commented 10 months ago

@Bharathjpv please share your error, and a reproducible toy example. we need to see what you are doing in your NVT and model training and eval pipeline to help you. thanks.