beta-team / beta-recsys

Beta-RecSys: Build, Evaluate and Tune Automated Recommender Systems
https://beta-recsys.readthedocs.io/en/latest/
MIT License
162 stars 33 forks source link

Problem with batch evaluation #431

Open JavierSanzCruza opened 1 year ago

JavierSanzCruza commented 1 year ago

Describe the bug

I found a bug which weirdly triggers when training a matrix factorization model (I suppose this might also happen with other models, but MF is the model I detected it first). I was training some matrix factorization algorithms when I suddenly found the error:

RuntimeError: dimension out of range (expected to be in range of [-1, 0], but got 1)

This error triggered on the eval_engine.py, on line 253 (I copy the code of the function and mark the line with a comment):

    def predict(self, data_df, model, batch_eval=False):
        """Make prediction for a trained model.
        Args:
            data_df (DataFrame): A dataset to be evaluated.
            model: A trained model.
            batch_eval (Boolean): A signal to indicate if the model is evaluated in batches.
        Returns:
            array: predicted scores.
        """
        user_ids = data_df[DEFAULT_USER_COL].to_numpy()
        item_ids = data_df[DEFAULT_ITEM_COL].to_numpy()
        if batch_eval:
            n_batch = len(data_df) // self.batch_size + 1
            predictions = np.array([])
            for idx in range(n_batch):
                start_idx = idx * self.batch_size
                end_idx = min((idx + 1) * self.batch_size, len(data_df))
                sub_user_ids = user_ids[start_idx:end_idx]
                sub_item_ids = item_ids[start_idx:end_idx]
                sub_predictions = np.array(
                    model.predict(sub_user_ids, sub_item_ids) #IT APPEARED HERE
                    .flatten()
                    .to(torch.device("cpu"))
                    .detach()
                    .numpy()
                )
                predictions = np.append(predictions, sub_predictions)
        else:
            predictions = np.array(
                model.predict(user_ids, item_ids)
                .flatten()
                .to(torch.device("cpu"))
                .detach()
                .numpy()
            )
        return predictions

After some research, I discovered that the error was due to batches with a single (user, item) pair. In that case, the lists sub_user_ids and sub_item_ids had just a single element, and then there is a problem on the sub_predictions computation that makes it fail (likely because, at some point, a scalar value is returned instead of a vector/list of values.

To Reproduce

In order to reproduce, given a dataset, you need to set the batch size in MF, so that num_test_interactions % batch_size = 1. That way, the size of the batch size will be exactly equal to 1, and the error will trigger.

Describe your attempts

I have solved the issue by modifying the code of the eval_engine.py predict function of the EvalEngine class as follows:

 def predict(self, data_df, model, batch_eval=False):
        """Make prediction for a trained model.
        Args:
            data_df (DataFrame): A dataset to be evaluated.
            model: A trained model.
            batch_eval (Boolean): A signal to indicate if the model is evaluated in batches.
        Returns:
            array: predicted scores.
        """
        user_ids = data_df[DEFAULT_USER_COL].to_numpy()
        item_ids = data_df[DEFAULT_ITEM_COL].to_numpy()
        if batch_eval:
            n_batch = len(data_df) // self.batch_size + 1
            predictions = np.array([])
            stop_batch = False # If we need to set a smaller number of batches

            for idx in range(n_batch):
                if not stop_batch:
                    start_idx = idx * self.batch_size
                    end_idx = min((idx + 1) * self.batch_size, len(data_df))

                    if len(data_df) == end_idx + 1:
                        end_idx = len(data_df)
                        stop_batch = True

                    sub_user_ids = user_ids[start_idx:end_idx]
                    sub_item_ids = item_ids[start_idx:end_idx]

                    sub_predictions = np.array(
                        model.predict(sub_user_ids, sub_item_ids)
                        .flatten()
                        .to(torch.device("cpu"))
                        .detach()
                        .numpy()
                    )
                    predictions = np.append(predictions, sub_predictions)
        else:
            predictions = np.array(
                model.predict(user_ids, item_ids)
                .flatten()
                .to(torch.device("cpu"))
                .detach()
                .numpy()
            )
        return predictions

Essentially, if the last batch contains a single item, it discards the last batch, and adds the orphan example to the second to last batch. That second to last batch would have an additional element on the prediction, but it does not crash anymore.

Context

I used the version of BetaRecsys I forked in [https://github.com/JavierSanzCruza/beta-recsys]. The main functionality remains the same and the only modification is the LightGCN model, where I removed the sigmoid on the output layer.

Additional Information

I will submit a merge request related to the issue with the solution I found.