I found a bug which weirdly triggers when training a matrix factorization model (I suppose this might also happen with other models, but MF is the model I detected it first). I was training some matrix factorization algorithms when I suddenly found the error:
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
This error triggered on the eval_engine.py, on line 253 (I copy the code of the function and mark the line with a comment):
def predict(self, data_df, model, batch_eval=False):
"""Make prediction for a trained model.
Args:
data_df (DataFrame): A dataset to be evaluated.
model: A trained model.
batch_eval (Boolean): A signal to indicate if the model is evaluated in batches.
Returns:
array: predicted scores.
"""
user_ids = data_df[DEFAULT_USER_COL].to_numpy()
item_ids = data_df[DEFAULT_ITEM_COL].to_numpy()
if batch_eval:
n_batch = len(data_df) // self.batch_size + 1
predictions = np.array([])
for idx in range(n_batch):
start_idx = idx * self.batch_size
end_idx = min((idx + 1) * self.batch_size, len(data_df))
sub_user_ids = user_ids[start_idx:end_idx]
sub_item_ids = item_ids[start_idx:end_idx]
sub_predictions = np.array(
model.predict(sub_user_ids, sub_item_ids) #IT APPEARED HERE
.flatten()
.to(torch.device("cpu"))
.detach()
.numpy()
)
predictions = np.append(predictions, sub_predictions)
else:
predictions = np.array(
model.predict(user_ids, item_ids)
.flatten()
.to(torch.device("cpu"))
.detach()
.numpy()
)
return predictions
After some research, I discovered that the error was due to batches with a single (user, item) pair. In that case, the lists sub_user_ids and sub_item_ids had just a single element, and then there is a problem on the sub_predictions computation that makes it fail (likely because, at some point, a scalar value is returned instead of a vector/list of values.
To Reproduce
In order to reproduce, given a dataset, you need to set the batch size in MF, so that num_test_interactions % batch_size = 1. That way, the size of the batch size will be exactly equal to 1, and the error will trigger.
Describe your attempts
[X] I checked the documentation and found no answer
[X] I checked to make sure that this is not a duplicate issue
I have solved the issue by modifying the code of the eval_engine.py predict function of the EvalEngine class as follows:
def predict(self, data_df, model, batch_eval=False):
"""Make prediction for a trained model.
Args:
data_df (DataFrame): A dataset to be evaluated.
model: A trained model.
batch_eval (Boolean): A signal to indicate if the model is evaluated in batches.
Returns:
array: predicted scores.
"""
user_ids = data_df[DEFAULT_USER_COL].to_numpy()
item_ids = data_df[DEFAULT_ITEM_COL].to_numpy()
if batch_eval:
n_batch = len(data_df) // self.batch_size + 1
predictions = np.array([])
stop_batch = False # If we need to set a smaller number of batches
for idx in range(n_batch):
if not stop_batch:
start_idx = idx * self.batch_size
end_idx = min((idx + 1) * self.batch_size, len(data_df))
if len(data_df) == end_idx + 1:
end_idx = len(data_df)
stop_batch = True
sub_user_ids = user_ids[start_idx:end_idx]
sub_item_ids = item_ids[start_idx:end_idx]
sub_predictions = np.array(
model.predict(sub_user_ids, sub_item_ids)
.flatten()
.to(torch.device("cpu"))
.detach()
.numpy()
)
predictions = np.append(predictions, sub_predictions)
else:
predictions = np.array(
model.predict(user_ids, item_ids)
.flatten()
.to(torch.device("cpu"))
.detach()
.numpy()
)
return predictions
Essentially, if the last batch contains a single item, it discards the last batch, and adds the orphan example to the second to last batch. That second to last batch would have an additional element on the prediction, but it does not crash anymore.
Context
OS [e.g. Windows 10, macOS 10.14]: Linux
Hardware [e.g. CPU only, GTX 1080 Ti]: RTX Titan
Environment [e.g. Pytorch 1.4, Tensorflow 2.0]: Cuda 11.3
I used the version of BetaRecsys I forked in [https://github.com/JavierSanzCruza/beta-recsys]. The main functionality remains the same and the only modification is the LightGCN model, where I removed the sigmoid on the output layer.
Additional Information
I will submit a merge request related to the issue with the solution I found.
Describe the bug
I found a bug which weirdly triggers when training a matrix factorization model (I suppose this might also happen with other models, but MF is the model I detected it first). I was training some matrix factorization algorithms when I suddenly found the error:
RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)
This error triggered on the eval_engine.py, on line 253 (I copy the code of the function and mark the line with a comment):
After some research, I discovered that the error was due to batches with a single (user, item) pair. In that case, the lists
sub_user_ids
andsub_item_ids
had just a single element, and then there is a problem on thesub_predictions
computation that makes it fail (likely because, at some point, a scalar value is returned instead of a vector/list of values.To Reproduce
In order to reproduce, given a dataset, you need to set the batch size in MF, so that
num_test_interactions % batch_size = 1
. That way, the size of the batch size will be exactly equal to 1, and the error will trigger.Describe your attempts
I have solved the issue by modifying the code of the eval_engine.py
predict
function of theEvalEngine
class as follows:Essentially, if the last batch contains a single item, it discards the last batch, and adds the orphan example to the second to last batch. That second to last batch would have an additional element on the prediction, but it does not crash anymore.
Context
I used the version of BetaRecsys I forked in [https://github.com/JavierSanzCruza/beta-recsys]. The main functionality remains the same and the only modification is the LightGCN model, where I removed the sigmoid on the output layer.
Additional Information
I will submit a merge request related to the issue with the solution I found.