RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.34k stars 604 forks source link

[🐛BUG] 负采样未利用显反馈样本 #2066

Open HowardZJU opened 1 month ago

HowardZJU commented 1 month ago

描述这个 bug 以ML-1M数据集为例,评分【1-5】。

生成的稀疏inter矩阵只存储了评分大于threshold的user-item。评分小于threshold的user-item,和未观测的user-item一同设为0。

这种做法没有有效利用显反馈负样本。把显反馈负样本和未观测样本都视作负样本。

问题和诉求

  1. 是否可以在训练阶段获取显反馈负样本,即rating<threshold的样本
  2. 是否可以在训练阶段同时获取显反馈负样本,以及负采样得到的未观测样本,并有效区分?

如何复现 复现这个 bug 的步骤: 在quick start中,于下列代码打断点观察即可。 train_data, valid_data, test_data = data_preparation(config, dataset)

实验环境:

HowardZJU commented 1 month ago

For example, to address the problems issued, whether it is feasible to change the _set_label_by_threshold(self) function, by setting negative labels to -1?

  def _set_label_by_threshold(self):
      """Generate 0/1 labels according to value of features.

      According to ``config['threshold']``, those rows with value lower than threshold will
      be given negative label, while the other will be given positive label.
      See :doc:`../user_guide/data/data_args` for detail arg setting.

      Note:
          Key of ``config['threshold']`` if a field name.
          This field will be dropped after label generation.
      """
      threshold = self.config["threshold"]
      if threshold is None:
          return

      self.logger.debug(f"Set label by {threshold}.")

      if len(threshold) != 1:
          raise ValueError("Threshold length should be 1.")

      self.set_field_property(
          self.label_field, FeatureType.FLOAT, FeatureSource.INTERACTION, 1
      )
      for field, value in threshold.items():
          if field in self.inter_feat:
              self.inter_feat[self.label_field] = (
                  self.inter_feat[field] >= value
              ).astype(int)
          else:
              raise ValueError(f"Field [{field}] not in inter_feat.")
          if field != self.label_field:
              self._del_col(self.inter_feat, field)