NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
262 stars 50 forks source link

[BUG] get_aliccp() failed while processing Ali CCP dataset #507

Open future-xy opened 2 years ago

future-xy commented 2 years ago

Bug description

I'm using get_aliccp() to process Ali CCP dataset following the multi-stage recsys example but met a problem:

Reading common features...: 730600it [09:54, 1229.12it/s]
Processing data...: 100000it [00:04, 24994.22it/s]
Traceback (most recent call last):
  File "dataprocess.py", line 30, in <module>
    train, valid = get_aliccp("/workspace/data/aliccp/")
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 101, in get_aliccp
    prepare_alliccp(path, output_dir=raw_path, file_size=file_size, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 155, in prepare_alliccp
    _convert_data(
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 442, in _convert_data
    df.to_parquet(
  File "/usr/local/lib/python3.8/dist-packages/pandas/util/_decorators.py", line 207, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/frame.py", line 2677, in to_parquet
    return to_parquet(
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parquet.py", line 416, in to_parquet
    impl.write(
  File "/usr/local/lib/python3.8/dist-packages/pandas/io/parquet.py", line 194, in write
    self.api.parquet.write_table(
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 2017, in write_table
    with ParquetWriter(
  File "/usr/local/lib/python3.8/dist-packages/pyarrow/parquet.py", line 663, in __init__
    self.writer = _parquet.ParquetWriter(
  File "pyarrow/_parquet.pyx", line 1377, in pyarrow._parquet.ParquetWriter.__cinit__
TypeError: __cinit__() got an unexpected keyword argument 'overwrite'

Steps/Code to reproduce bug

My data processing code is:

from merlin.datasets.ecommerce import get_aliccp
from merlin.datasets.ecommerce import transform_aliccp
from merlin.datasets.synthetic import generate_data
import logging
from merlin.schema.tags import Tags
from merlin.models.utils.example_utils import workflow_fit_transform
from nvtabular.ops import *
import time
import os

# disable INFO and DEBUG logging everywhere
logging.disable(logging.WARNING)

start = time.time()

DATA_FOLDER = os.environ.get("DATA_FOLDER", "/workspace/data/aliccp")
output_path = os.path.join(DATA_FOLDER, 'processed/ranking')

train, valid = get_aliccp("/workspace/data/aliccp/")

user_id = ["user_id"] >> Categorify(dtype='int32') >> TagAsUserID()
item_id = ["item_id"] >> Categorify(dtype='int32') >> TagAsItemID()

item_features = ["item_category", "item_shop", "item_brand"] >> Categorify(dtype='int32') >> TagAsItemFeatures()

user_features = ['user_shops', 'user_profile', 'user_group',
                 'user_gender', 'user_age', 'user_consumption_2', 'user_is_occupied',
                 'user_geography', 'user_intentions', 'user_brands', 'user_categories'] \
    >> Categorify(dtype='int32') >> TagAsUserFeatures()

targets = ["click"] >> AddMetadata(tags=[Tags.BINARY_CLASSIFICATION, "target"])

outputs = user_id+item_id+item_features+user_features+targets

transform_aliccp((train, valid), output_path, nvt_workflow=outputs, workflow_name='workflow_ranking')
print(f"Finished in {time.time()-start} seconds")

My data directory is like:

aliccp/
├─test/
│ ├─common_features_test.csv
│ └─sample_skeleton_test.csv
└─train/
  ├─sample_skeleton_train.csv
  └─common_features_train.csv

Expected behavior

Train and valid dataset should be returned.

Environment details

future-xy commented 2 years ago

It seems the parameter overwrite=True in https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L442-L445 is not recognised by low components. I commented this line to bypass the problem but I'm not sure if this will cause other bugs.

future-xy commented 2 years ago

It seems the parameter overwrite=True in

https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L442-L445

is not recognised by low components. I commented this line to bypass the problem but I'm not sure if this will cause other bugs.

After bypassing this bug, I met another problem:

Traceback (most recent call last):
  File "dataprocess.py", line 19, in <module>
    train, valid = get_aliccp("/workspace/data/aliccp/")
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 101, in get_aliccp
    prepare_alliccp(path, output_dir=raw_path, file_size=file_size, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 155, in prepare_alliccp
    _convert_data(
  File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 450, in _convert_data
    merlin.io.Dataset(tmp_files, dtypes=dtypes).to_parquet(out_dir)
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 347, in __init__
    self.infer_schema()
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1128, in infer_schema
    dtypes = self.sample_dtypes(n=n, annotate_lists=True)
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1150, in sample_dtypes
    _real_meta = _set_dtypes(_real_meta, self.dtypes)
  File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1188, in _set_dtypes
    chunk[col] = chunk[col].astype(dtype)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5815, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 418, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 327, in apply
    applied = getattr(b, f)(**kwargs)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py", line 591, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1309, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1257, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1174, in astype_nansafe
    return lib.astype_intsafe(arr, dtype)
  File "pandas/_libs/lib.pyx", line 679, in pandas._libs.lib.astype_intsafe
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
future-xy commented 2 years ago

BTW, the get_aliccp function currently checks if "/raw" path exists to decide whether preparing the dataset or not, as shown below. https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L97-L101 This seems, however, not robust enough if prepare_alliccp function doesn't really finish its job. In the cases mentioned in this issue, I have to rm raw -rf and https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L379-L380 has to be executed each time.

viswa-nvidia commented 2 years ago

Setting P1 as there is a workaround available.