NVIDIA-Merlin / models

Merlin Models is a collection of deep learning recommender system model reference implementations
https://nvidia-merlin.github.io/models/main/index.html
Apache License 2.0
248 stars 50 forks source link

[BUG] Error with prepare_aliccp() #1226

Open ZhanqiuHu opened 8 months ago

ZhanqiuHu commented 8 months ago

I ran into this error when running prepare_aliccp() on downloaded Ali-CCP datasets.

Traceback (most recent call last):
  File "/share/suh-scrap/zh338/aliccp/preprocess.py", line 13, in <module>
    prepare_aliccp(DATA_DIR, convert_train=False, convert_test=True)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 164, in prepare_aliccp
    _convert_data(
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 449, in _convert_data
    merlin.io.Dataset(tmp_files, dtypes=dtypes).to_parquet(out_dir)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 380, in __init__
    self.infer_schema()
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1240, in infer_schema
    dtypes = self.sample_dtypes(n=n, annotate_lists=True)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1264, in sample_dtypes
    _real_meta = _set_dtypes(_real_meta, self.dtypes)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/merlin/io/dataset.py", line 1301, in _set_dtypes
    chunk[col] = chunk[col].astype(dtype)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/generic.py", line 6240, in astype
    new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 448, in astype
    return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/managers.py", line 352, in apply
    applied = getattr(b, f)(**kwargs)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/internals/blocks.py", line 526, in astype
    new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 299, in astype_array_safe
    new_values = astype_array(values, dtype, copy=copy)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 230, in astype_array
    values = astype_nansafe(values, dtype, copy=copy)
  File "/home/zh338/.conda/envs/merlin-env/lib/python3.10/site-packages/pandas/core/dtypes/astype.py", line 170, in astype_nansafe
    return arr.astype(dtype, copy=True)
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

I saw another issue (#507 ) talking about a similar problem but didn't really mention the solution/workaround, so I'm wondering what is a workaround to avoid this error?

Thanks!

ibraheemalayan commented 2 months ago

Same issue here, any updates ?

ibraheemalayan commented 2 months ago

The dataset contains None values as seen if you display the head of the dataset

Screenshot 2024-05-04 at 15 38 22

I solved it by changing

https://github.com/NVIDIA-Merlin/models/blob/eb1e54196a64a70950b2a7e7744d2150e052d53e/merlin/datasets/ecommerce/aliccp/dataset.py#L448

to

dtypes = {f.name: "Int32" for f in _Features().features}

( Int32 with capital means nullable integer )

with the new dtypes

Screenshot 2024-05-04 at 15 39 06