RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.44k stars 615 forks source link

[🐛BUG] The number of negative items are not changed. #1646

Open JimLiu96 opened 1 year ago

JimLiu96 commented 1 year ago

Describe the bug When I changed the number of training negative items to be a 5, and run the BPR model to print the number of negative items retrieved for training, I find that the sampled negative items is still the same as positive items. I am not sure it is bug or my mis-use of the API. Could you please help me resolving this issue?

To Reproduce Steps to reproduce the behavior:

  1. extra yaml file :

    data_path: ./dataset
    embedding_size: 64
    dropout_prob: 0.2
    epochs: 500
    eval_step: 20
    learning_rate: 1e-3
    train_batch_size: 1024
    training_neg_sample_num: 5
    train_neg_sample_args:
    num: 5
    distribution: uniform
    sample_num: 5
    alpha: 1.0
    dynamic: False
    candidate_num: 5
  2. I just run python run_recbole.py --model=BPR --dataset=Beauty --config_files "bpr_config.yaml". Within the calculate_loss function of BPR model, I print the shape pf both positive items and negative items as

    def calculate_loss(self, interaction):
    user = interaction[self.USER_ID]
    pos_item = interaction[self.ITEM_ID]
    neg_item = interaction[self.NEG_ITEM_ID]
    # print('keys:',interaction.interaction.keys())
    print('pos:',pos_item.shape)
    print('neg:',neg_item.shape)
    # print('neg id:',self.NEG_ITEM_ID)
    ...
  3. The output is however, showing that

    ...
    pos: torch.Size([1020])
    neg: torch.Size([1020])
    pos: torch.Size([1020])
    neg: torch.Size([1020]) 
    ...

which indicates that the number of positive and negative items are the same.

  1. The output config is as follows:
    
    General Hyper Parameters:
    gpu_id = 0
    use_gpu = True
    seed = 0
    state = INFO
    reproducibility = True
    data_path = ./dataset/Beauty
    show_progress = True

Training Hyper Parameters: checkpoint_dir = saved epochs = 500 train_batch_size = 1024 learner = adam learning_rate = 0.001 training_neg_sample_num = 5 training_neg_sample_distribution = uniform eval_step = 20 stopping_step = 10 clip_grad_norm = None weight_decay = 1e-06 draw_loss_pic = False loss_decimal_place = 4

Evaluation Hyper Parameters: eval_setting = RO_RS,full group_by_user = True split_ratio = [0.8, 0.1, 0.1] leave_one_num = 2 real_time_process = False metrics = ['Recall', 'NDCG'] topk = [10, 20, 50] valid_metric = NDCG@20 eval_batch_size = 4096 metric_decimal_place = 4

Dataset Hyper Parameters: field_separator =
seq_separator =
USER_ID_FIELD = user_id ITEM_ID_FIELD = item_id RATING_FIELD = rating TIME_FIELD = timestamp seq_len = None LABEL_FIELD = label threshold = None NEGPREFIX = neg load_col = {'inter': ['user_id', 'item_id', 'timestamp']} unload_col = None unused_col = None additional_feat_suffix = None lowest_val = None highest_val = None equal_val = None not_equal_val = None max_user_inter_num = None min_user_inter_num = 5 max_item_inter_num = None min_item_inter_num = 5 fields_in_same_space = None preload_weight = None normalize_field = None normalize_all = None ITEM_LIST_LENGTH_FIELD = item_length LIST_SUFFIX = _list MAX_ITEM_LIST_LENGTH = 50 POSITION_FIELD = position_id HEAD_ENTITY_ID_FIELD = head_id TAIL_ENTITY_ID_FIELD = tail_id RELATION_ID_FIELD = relation_id ENTITY_ID_FIELD = entity_id

Other Hyper Parameters: valid_metric_bigger = True rm_dup_inter = None filter_inter_by_user_or_item = True SOURCE_ID_FIELD = source_id TARGET_ID_FIELD = target_id benchmark_filename = None MODEL_TYPE = ModelType.GENERAL embedding_size = 64 dropout_prob = 0.2 train_neg_sample_args = {'strategy': 'by', 'by': 5, 'distribution': 'uniform'} MODEL_INPUT_TYPE = InputType.PAIRWISE eval_type = EvaluatorType.RANKING device = cuda

Ethan-TZ commented 1 year ago
@JimLiu96 Hello, thanks for your attention to RecBole! This is a normal since the size of ITEM_ID and NEG_ITEM_ID in interaction is always the same, and this could not reflect the ratio of negative samples from here. For example, if we have only two item, i,e, 1,2, and we set the number of training negative items to be 3. The total training set could be denoted as: <1, 3 4 5> and <2, 7 8 9>. In fact, the interaction is formulated as: ITEM_ID NEG_ITEM_ID
1 3
1 4
1 5
2 7
2 8
2 9

Therefore, you can determine the number of negative items by checking the occurrence of each positive item.