RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.27k stars 590 forks source link

[🐛BUG] config val_interval fails for yelp dataset (but works for ml100k) #1986

Closed jimmy-academia closed 5 months ago

jimmy-academia commented 5 months ago

Describe the bug

Yelp dataset fails to utilize 'val_interval': {'rating': "[3, inf)"}, and gives ValueError: Field [rating] not defined in dataset.

I use config file

from recbole.config import Config
from recbole.data.dataset import Dataset
from recbole.data import create_dataset

additional_config = {
    'rm_dup_inter': 'first',
    'val_interval': {'rating': "[3, inf)"},
    'user_inter_num_interval': '[15, inf)',
    'item_inter_num_interval': '[15, inf)', 
}

config = Config(model='LightGCN', dataset=dataset_name, config_file_list=[])
config['eval_args']['order'] = split_order
config['data_path'] = f'cache_data/{dataset}/raw'
for key, value in additional_config.items():
       config[key] = value

dataset = Dataset(config) # dataset = create_dataset(config) # both lines present same error

where dataset_name='yelp' or dataset_name='ml100k'

Expected behavior A clear and concise description of what you expected to happen.

Traceback (most recent call last): File "data.py", line 121, in train_data, valid_data, test_data = prepare_recbole_data('yelp') File "data.py", line 59, in prepare_recbole_data dataset = create_dataset(config) File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/utils.py", line 72, in create_dataset dataset = dataset_class(config) File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/dataset/dataset.py", line 108, in init self._from_scratch() File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/dataset/dataset.py", line 120, in _from_scratch self._data_processing() File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/dataset/dataset.py", line 162, in _data_processing self._data_filtering() File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/dataset/dataset.py", line 187, in _data_filtering self._filter_by_field_value() File "anaconda3/envs/recbull/lib/python3.10/site-packages/recbole/data/dataset/dataset.py", line 1041, in _filter_by_field_value raise ValueError(f"Field [{field}] not defined in dataset.") ValueError: Field [rating] not defined in dataset.

printing self.field2type at line 1041 shows the following for dataset ='yelp' {'user_id': <FeatureType.TOKEN: 'token'>, 'item_id': <FeatureType.TOKEN: 'token'>}

Desktop (please complete the following information):

Side Note

I also have problem with

RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

for torch.distributed.barrier() in recbole/data/dataset/dataset.py", line 252, in _download when I run the code for the first time and the datasets are downloaded. What is the minimum fix to prevent this error?

jimmy-academia commented 5 months ago

I found under properties/ there is ml-100k.yaml but no yaml file for yelp