yin214 commented 6 months ago

描述这个 bug FEARec模型在sports数据集上训练时会陷入死循环卡住

如何复现 复现这个 bug 的步骤：

您引入的额外 yaml 文件

hidden_dropout_prob: 0.5 # (float) The probability of an element to be zeroed. attn_dropout_prob: 0.5 # (float) The probability of an attention score to be zeroed.

global_ratio: 0.6 # (float) The ratio of frequency components dual_domain: False # (bool) Frequency domain processing or not std: False # (bool) Use the specific time index or not spatial_ratio: 0.1 # (float) The ratio of the spatial domain and frequency domain fredom: True # (bool) Regularization in the frequency domain or not fredom_type: None # (str) The type of loss in different scenarios topk_factor: 5 # (int) To aggregate time delayed sequences with high autocorrelation

epochs: 100 #训练的最大轮数 train_batch_size: 8192 eval_batch_size: 8192

learning_rate: 0.001

training_neg_sample_num: 1 #负采样数目

eval_step: 1 #每次训练后做evalaution的次数 stopping_step: 10 valid_metric: recall@20

topk: [1,5,10,20]

neg_sampling: ~

eval_args: {'split':{'RS': [0.8,0.1,0.1]}, 'order': 'TO', 'mode': 'full'}

4. 您的运行脚本
python run_recbole.py --model=FEARec --dataset=sports --config_files=./config_files/fearec.yaml --checkpoint_dir='./saved/FEARec/sports' 
**预期**
跑了其他几个数据集没有出现这种情况

**屏幕截图**
卡在这种状态不动了
![屏幕截图 2024-03-18 215932](https://github.com/RUCAIBox/RecBole/assets/75625475/3e33aff2-0ca9-4f41-ad86-84616a7f9ad9)
应该是在模型代码213到223行陷入死循环

        while True:
            sample_index = random.choice(targets_index)
            cur_item_list = interaction[self.ITEM_SEQ][i].to("cpu")
            sample_item_list = dataset.inter_feat[self.ITEM_SEQ][sample_index]
            are_equal = torch.equal(cur_item_list, sample_item_list)
            sample_item_length = dataset.inter_feat[self.ITEM_SEQ_LEN][sample_index]
            if not are_equal or lens == 1:
                #print("helllo")
                sem_pos_lengths.append(sample_item_length)
                sem_pos_seqs.append(sample_item_list)
                break



**链接**
添加能够复现 bug 的代码链接，如 Colab 或者其他在线 Jupyter 平台。（可选）

**实验环境（请补全下列信息）：**
我在两台机器上都出现了这个bug

yin214 commented 6 months ago

# Basic Information
USER_ID_FIELD: user_id          # (str) Field name of user ID feature.
ITEM_ID_FIELD: item_id          # (str) Field name of item ID feature.
RATING_FIELD: rating            # (str) Field name of rating feature.
TIME_FIELD: timestamp           # (str) Field name of timestamp feature.
seq_len: ~                      # (dict) Field name of sequence feature: maximum length of each sequence
LABEL_FIELD: label              # (str) Expected field name of the generated labels for point-wise dataLoaders. 
threshold: ~                    # (dict) 0/1 labels will be generated according to the pairs.
NEG_PREFIX: neg_                # (str) Negative sampling prefix for pair-wise dataLoaders.

# Sequential Model Needed
ITEM_LIST_LENGTH_FIELD: item_length   # (str) Field name of the feature representing item sequences' length. 
LIST_SUFFIX: _list              # (str) Suffix of field names which are generated as sequences.
MAX_ITEM_LIST_LENGTH: 50       # (int) Maximum length of each generated sequence.
POSITION_FIELD: position_id     # (str) Field name of the generated position sequence.

user_inter_num_interval: "[10,inf)"
item_inter_num_interval: "[10,inf)"

load_col:                       # (dict) The suffix of atomic files: (list) field names to be loaded.
    inter: [user_id, item_id, rating, timestamp]
    item: [item_id, categories]
selected_features: [categories]
item_attribute: categories

TayTroye commented 6 months ago

@yin214 Hello! Thanks for your careful check! We have fixed this bug in #2024

RUCAIBox / RecBole

[🐛BUG] FEARec模型训练时陷入死循环 #2020

training_neg_sample_num: 1 #负采样数目