NVIDIA-Merlin / NVTabular

NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k stars 143 forks source link

[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864

Open piojanu opened 1 year ago

piojanu commented 1 year ago

Describe the bug ops.Categorify raises ValueError: Column must have no nulls. when num_buckets > 1 and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAi

Steps/Code to reproduce bug

import gc

import dask.dataframe as dd
import numpy as np
import pandas as pd

import nvtabular as nvt

# Generate synthetic data
N_ROWS = 100_000_000
CHUNK_SIZE = 10_000_000

N = N_ROWS // CHUNK_SIZE
dataframes = []
for i in range(N):
    print(f"{i+1}/{N}")
    chunk_data = np.random.lognormal(3., 10., int(CHUNK_SIZE)).astype(np.int32)
    chunk_ddf = dd.from_pandas(pd.DataFrame({'session_id': (chunk_data // 45), 'item_id': chunk_data}), npartitions=1)
    dataframes.append(chunk_ddf)

ddf = dd.concat(dataframes, axis=0)
del dataframes
gc.collect()

# !!! When `shuffle_by_keys` is commented out, the code finishes successfully
dataset = nvt.Dataset(ddf).shuffle_by_keys(keys=["session_id"])

_categorical_feats = [
    "item_id",
] >> nvt.ops.Categorify(
    freq_threshold=5,
    # !!! When `num_buckets=None`, the code finishes successfully
    num_buckets=100,
)

workflow = nvt.Workflow(_categorical_feats)
workflow.fit(dataset)
workflow.output_schema

Expected behavior Properly fitted op.Categorify when num_buckets > 1 and the dataset is shuffled by keys.

Environment details (please complete the following information):

My Dockerfile:

# AFTER https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.08

# Install Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg  add - && apt-get update -y && apt-get install google-cloud-sdk -y

# Copy your project to the Docker image
COPY . /project
WORKDIR /project

# Install Python dependencies
RUN pip install -U pip
RUN pip install -r requirements/base.txt

# Run Jupyter Lab by default, with no authentication, on port 8080
EXPOSE 8080
CMD ["jupyter-lab", "--allow-root", "--ip=0.0.0.0", "--port=8080", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'"]

Additional context I need to call shuffle_by_keys because I then do the GroupBy operation.