Closed RuihongQiu closed 3 years ago
Hi @RuihongQiu, a relatively small train batch size may cause the issue. When set train_batch_size: 1024
, everything is fine.
When the batch size is small, the item sequences in a batch are all short, and the mask probability is small, then there exists a situation that all the items in a batch will not be masked (in a small probability, but it is indeed possible). Currently, no masked item results in a NaN
issue. So a larger batch size will reduce the probability of crash.
Thanks for pointing out this issue, we are considering fix this in the next version. For now, you can just increase your train batch size.
Yea, it looks good now. Thank you for the support :)
Describe the bug A clear and concise description of what the bug is. I try to run BERT4Rec with and without a yaml file, the one from the blog on the
Amazon_Clothing_Shoes_and_Jewelry
downloaded from Google Drive. And I will have a loss nan issue during the first few epochs.To Reproduce Steps to reproduce the behavior:
test.yaml
is from the blog as above indicated.max_user_inter_num: 100
min_user_inter_num: 5
max_item_inter_num: 100
min_item_inter_num: 5
loss_type: "CE"
指定从什么文件里读什么列,这里就是从ml-1m.inter里面读取user_id, item_id, rating, timestamp这四列,剩下的以此类推
load_col: inter: [user_id, item_id, rating, timestamp]
training settings
epochs: 500 #训练的最大轮数 train_batch_size: 256 #训练的batch_size learner: adam #使用的pytorch内置优化器 learning_rate: 0.001 #学习率 training_neg_sample_num: 0 #负采样数目 eval_step: 1 #每次训练后做evalaution的次数 stopping_step: 10 #控制训练收敛的步骤数,在该步骤数内若选取的评测标准没有什么变化,就可以提前停止了
evalution settings
eval_setting: TO_LS,full #对数据按时间排序,设置留一法划分数据集,并使用全排序 metrics: ["Recall", "MRR","NDCG","Hit","Precision"] #评测标准 valid_metric: MRR@10 #选取哪个评测标准作为作为提前停止训练的标准 eval_batch_size: 256 #评测的batch_size