Describe the bug
Training MovieLens-100K on algorithms Random, ADMMSLIM, and SLIMElastic crashes with exception "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!"
CUDA available: True
command line args [--data_set_name MovieLens-100K --model_name Random] will not be used in RecBole
24 Jan 15:52 INFO
General Hyper Parameters:
gpu_id = 0
use_gpu = True
seed = 42
state = INFO
reproducibility = True
data_path = ./data_sets/MovieLens-100K
checkpoint_dir = ./data_sets/MovieLens-100K/recbole_checkpoints/
show_progress = True
save_dataset = False
dataset_save_path = None
save_dataloaders = False
dataloaders_save_path = None
log_wandb = False
Training Hyper Parameters:
epochs = 50
train_batch_size = 2048
learner = adam
learning_rate = 0.001
train_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}
eval_step = 5
stopping_step = 10
clip_grad_norm = None
weight_decay = 0.0
loss_decimal_place = 4
Evaluation Hyper Parameters:
eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}
repeatable = False
metrics = ['Recall', 'MRR', 'NDCG', 'Hit', 'MAP', 'Precision', 'GAUC', 'ItemCoverage', 'AveragePopularity', 'GiniIndex', 'ShannonEntropy', 'TailPercentage']
topk = [1, 3, 5, 10, 20]
valid_metric = NDCG@10
valid_metric_bigger = True
eval_batch_size = 4096
metric_decimal_place = 4
Dataset Hyper Parameters:
field_separator =
seq_separator =
USER_ID_FIELD = user_id
ITEM_ID_FIELD = item_id
RATING_FIELD = rating
TIME_FIELD = timestamp
seq_len = {}
LABEL_FIELD = label
threshold = None
NEG_PREFIX = neg_
load_col = {'inter': ['user_id', 'item_id', 'rating']}
unload_col = {}
unused_col = {}
additional_feat_suffix = []
rm_dup_inter = None
val_interval = {}
filter_inter_by_user_or_item = True
user_inter_num_interval = [0, inf)
item_inter_num_interval = [0, inf)
alias_of_user_id = None
alias_of_item_id = None
alias_of_entity_id = None
alias_of_relation_id = None
preload_weight = {}
normalize_field = []
normalize_all = False
ITEM_LIST_LENGTH_FIELD = item_length
LIST_SUFFIX = _list
MAX_ITEM_LIST_LENGTH = 50
POSITION_FIELD = position_id
HEAD_ENTITY_ID_FIELD = head_id
TAIL_ENTITY_ID_FIELD = tail_id
RELATION_ID_FIELD = relation_id
ENTITY_ID_FIELD = entity_id
benchmark_filename = None
Other Hyper Parameters:
worker = 0
wandb_project = recbole
shuffle = True
require_pow = False
enable_amp = False
enable_scaler = False
transform = None
numerical_features = []
discretization = None
kg_reverse_r = False
entity_kg_num_interval = [0, inf)
relation_kg_num_interval = [0, inf)
MODEL_TYPE = ModelType.GENERAL
encoding = utf-8
training_neg_sample_args = {'distribution': 'uniform', 'sample_num': 1, 'dynamic': False, 'candidate_num': 0}
MODEL_INPUT_TYPE = InputType.POINTWISE
eval_type = EvaluatorType.RANKING
single_spec = True
local_rank = 0
device = cuda
valid_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}
test_neg_sample_args = {'distribution': 'uniform', 'sample_num': 100}
24 Jan 15:52 INFO MovieLens-100K
The number of users: 944
Average actions of users: 106.04453870625663
The number of items: 1683
Average actions of items: 59.45303210463734
The number of inters: 100000
The sparsity of the dataset: 93.70575143257098%
Remain Fields: ['user_id', 'item_id', 'rating']
24 Jan 15:52 INFO [Training]: train_batch_size = [2048] train_neg_sample_args: [{'distribution': 'uniform', 'sample_num': 1, 'alpha': 1.0, 'dynamic': False, 'candidate_num': 0}]
24 Jan 15:52 INFO [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'RO', 'group_by': 'user', 'mode': {'valid': 'uni100', 'test': 'uni100'}}]
24 Jan 15:52 INFO Random()
Trainable parameters: 1
24 Jan 15:52 INFO epoch 0 training [time: 0.22s, train loss: 0.0000]
24 Jan 15:52 INFO epoch 1 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52 INFO epoch 2 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52 INFO epoch 3 training [time: 0.19s, train loss: 0.0000]
24 Jan 15:52 INFO epoch 4 training [time: 0.19s, train loss: 0.0000]
Traceback (most recent call last):
File "/mnt/./run_recbole_test.py", line 158, in <module>
best_valid_score, best_valid_result = trainer.fit(train_data, valid_data)
File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 464, in fit
valid_score, valid_result = self._valid_epoch(
File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 283, in _valid_epoch
valid_result = self.evaluate(
File "/usr/local/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 616, in evaluate
interaction, scores, positive_u, positive_i = eval_func(batched_data)
File "/usr/local/lib/python3.10/site-packages/recbole/trainer/trainer.py", line 558, in _neg_sample_batch_eval
scores[row_idx, col_idx] = origin_scores
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Expected behavior
Models from the algorithms Random, ADMMSLIM, and SLIMElastic should be trained and evaluated on the MovieLens-100K data set without crashing.
Desktop (please complete the following information):
OS: Linux
RecBole Version: 1.2.0
Python Version: 3.10
PyTorch Version: 2.1.1
cudatoolkit Version: 12.1
I believe this happens during validation and that the same bug was fixed for different models in #1873.
Describe the bug Training MovieLens-100K on algorithms Random, ADMMSLIM, and SLIMElastic crashes with exception "RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and CPU!"
To Reproduce Steps to reproduce the behavior:
Expected behavior Models from the algorithms Random, ADMMSLIM, and SLIMElastic should be trained and evaluated on the MovieLens-100K data set without crashing.
Desktop (please complete the following information):
I believe this happens during validation and that the same bug was fixed for different models in #1873.