Open future-xy opened 2 years ago
It seems the parameter overwrite=True
in
https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L442-L445
is not recognised by low components. I commented this line to bypass the problem but I'm not sure if this will cause other bugs.
It seems the parameter
overwrite=True
inis not recognised by low components. I commented this line to bypass the problem but I'm not sure if this will cause other bugs.
After bypassing this bug, I met another problem:
Traceback (most recent call last):
File "dataprocess.py", line 19, in <module>
train, valid = get_aliccp("/workspace/data/aliccp/")
File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 101, in get_aliccp
prepare_alliccp(path, output_dir=raw_path, file_size=file_size, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 155, in prepare_alliccp
_convert_data(
File "/usr/local/lib/python3.8/dist-packages/merlin/datasets/ecommerce/aliccp/dataset.py", line 450, in _convert_data
merlin.io.Dataset(tmp_files, dtypes=dtypes).to_parquet(out_dir)
File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 347, in __init__
self.infer_schema()
File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1128, in infer_schema
dtypes = self.sample_dtypes(n=n, annotate_lists=True)
File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1150, in sample_dtypes
_real_meta = _set_dtypes(_real_meta, self.dtypes)
File "/usr/local/lib/python3.8/dist-packages/merlin/io/dataset.py", line 1188, in _set_dtypes
chunk[col] = chunk[col].astype(dtype)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/generic.py", line 5815, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 418, in astype
return self.apply("astype", dtype=dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/managers.py", line 327, in apply
applied = getattr(b, f)(**kwargs)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/internals/blocks.py", line 591, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1309, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1257, in astype_array
values = astype_nansafe(values, dtype, copy=copy)
File "/usr/local/lib/python3.8/dist-packages/pandas/core/dtypes/cast.py", line 1174, in astype_nansafe
return lib.astype_intsafe(arr, dtype)
File "pandas/_libs/lib.pyx", line 679, in pandas._libs.lib.astype_intsafe
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
BTW, the get_aliccp
function currently checks if "/raw" path exists to decide whether preparing the dataset or not, as shown below.
https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L97-L101
This seems, however, not robust enough if prepare_alliccp
function doesn't really finish its job.
In the cases mentioned in this issue, I have to rm raw -rf
and
https://github.com/NVIDIA-Merlin/models/blob/6eb034e61108f7231d66a37c2672426e69a19f8d/merlin/datasets/ecommerce/aliccp/dataset.py#L379-L380
has to be executed each time.
Setting P1 as there is a workaround available.
Bug description
I'm using
get_aliccp()
to process Ali CCP dataset following the multi-stage recsys example but met a problem:Steps/Code to reproduce bug
My data processing code is:
My data directory is like:
Expected behavior
Train and valid dataset should be returned.
Environment details