Closed vs385 closed 1 year ago
@vs385 where is this error coming from? what's your NVT pipeline? IndexError: list index out of range
happens generally pipeline cannot access to the data files, or your data path does not have any file.
can you please use merlin-tensorflow:23.06
image and test again?
@rnyak this error is coming from fitting the workflow on my nvttabular dataset (which I load from a parquet file that was saved using dask_cudf - I follow the example notebook from here and adapted the column names and others to fit my own dataset.
IndexError: list index out of range happens generally pipeline cannot access to the data files, or your data path does not have any file. Yes I know, and I did also check if the files were in the path and they were all there (there were around 24 parquet files partitions in the directory), which is why I couldn't understand where this error was coming from.
can you please use merlin-tensorflow:23.06 image and test again? I tried with 23.06 and it works without any error- it's only with 23.08 that I'm getting this error, so I had to downgrade back after that.
closing this ticket since the user can make it work with 23.06 image.
Describe the bug I started getting this error when I upgraded to nvtabular 23.08:
2023-08-30 14:53:26,012 - distributed.worker - WARNING - Compute Failed Key: ('transform-6435a10df7c6acb222987cc5dda4ed1d', 0) Function: subgraph_callable-98f2413b-b494-431b-a9e6-3f229f0d args: ({'piece': ('PWD/basedir/data/train_pre_full.parquet/part.0.parquet', [0, 1, 2], [])}) kwargs: {} Exception: "IndexError('list index out of range')"
Steps/Code to reproduce bug Trying to fit an nvtabular workflow on a full dataset parquet file with 20+ partitions using LocalCUDACluster and writing the parquet file using dask_cudf before loading its as an not dataset for fitting the workflow proc.fit(full_dataset)
Expected behavior It should successfully fit as it used to when I was running with nvtabular 23.06- when I run the same above using multi-gpu instance, the error does not get thrown
Environment details (please complete the following information):
Merlin version: nvcr.io/nvidia/merlin/merlin-tensorflow:23.08 Platform: ec2 g5 instance linux (8xlarge) large single GPU instance Python version: 3.10.12 PyTorch version (GPU?): Tensorflow version (GPU?): 2.12.0+nv23.6
docker run --runtime=nvidia --rm -it -p 8888:8888 -p 8797:8787 -p 8796:8786 --ipc=host --cap-add SYS_NICE nvcr.io/nvidia/merlin/merlin-tensorflow:latest /bin/bash
Additional context Add any other context about the problem here.