NVTabular is a feature engineering and preprocessing library for tabular data designed to quickly and easily manipulate terabyte scale datasets used to train deep learning based recommender systems.
Apache License 2.0
1.05k
stars
143
forks
source link
[BUG] ops.Categorify frequency hashing raises RuntimeError when the dataset is shuffled by keys #1864
Describe the bugops.Categorify raises ValueError: Column must have no nulls. when num_buckets > 1 and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAi
Steps/Code to reproduce bug
import gc
import dask.dataframe as dd
import numpy as np
import pandas as pd
import nvtabular as nvt
# Generate synthetic data
N_ROWS = 100_000_000
CHUNK_SIZE = 10_000_000
N = N_ROWS // CHUNK_SIZE
dataframes = []
for i in range(N):
print(f"{i+1}/{N}")
chunk_data = np.random.lognormal(3., 10., int(CHUNK_SIZE)).astype(np.int32)
chunk_ddf = dd.from_pandas(pd.DataFrame({'session_id': (chunk_data // 45), 'item_id': chunk_data}), npartitions=1)
dataframes.append(chunk_ddf)
ddf = dd.concat(dataframes, axis=0)
del dataframes
gc.collect()
# !!! When `shuffle_by_keys` is commented out, the code finishes successfully
dataset = nvt.Dataset(ddf).shuffle_by_keys(keys=["session_id"])
_categorical_feats = [
"item_id",
] >> nvt.ops.Categorify(
freq_threshold=5,
# !!! When `num_buckets=None`, the code finishes successfully
num_buckets=100,
)
workflow = nvt.Workflow(_categorical_feats)
workflow.fit(dataset)
workflow.output_schema
Expected behavior
Properly fitted op.Categorify when num_buckets > 1 and the dataset is shuffled by keys.
Environment details (please complete the following information):
Environment location: JupyterLab in Docker on GCP
Method of NVTabular install: Docker
My Dockerfile:
# AFTER https://github.com/GoogleCloudPlatform/nvidia-merlin-on-vertex-ai
FROM nvcr.io/nvidia/merlin/merlin-pytorch:23.08
# Install Google Cloud SDK
RUN echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] http://packages.cloud.google.com/apt cloud-sdk main" | tee -a /etc/apt/sources.list.d/google-cloud-sdk.list && curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | apt-key --keyring /usr/share/keyrings/cloud.google.gpg add - && apt-get update -y && apt-get install google-cloud-sdk -y
# Copy your project to the Docker image
COPY . /project
WORKDIR /project
# Install Python dependencies
RUN pip install -U pip
RUN pip install -r requirements/base.txt
# Run Jupyter Lab by default, with no authentication, on port 8080
EXPOSE 8080
CMD ["jupyter-lab", "--allow-root", "--ip=0.0.0.0", "--port=8080", "--no-browser", "--NotebookApp.token=''", "--NotebookApp.allow_origin='*'"]
Additional context
I need to call shuffle_by_keys because I then do the GroupBy operation.
Describe the bug
ops.Categorify
raisesValueError: Column must have no nulls.
whennum_buckets > 1
and the dataset is shuffled by keys. EDIT: The whole error message: https://pastebin.com/GJRQhxAiSteps/Code to reproduce bug
Expected behavior Properly fitted
op.Categorify
whennum_buckets > 1
and the dataset is shuffled by keys.Environment details (please complete the following information):
My Dockerfile:
Additional context I need to call
shuffle_by_keys
because I then do the GroupBy operation.