RuihongQiu commented 3 years ago

Describe the bug A clear and concise description of what the bug is. I try to run BERT4Rec with and without a yaml file, the one from the blog on the Amazon_Clothing_Shoes_and_Jewelry downloaded from Google Drive. And I will have a loss nan issue during the first few epochs.

To Reproduce Steps to reproduce the behavior:

extra yaml file The test.yaml is from the blog as above indicated.


# model config
embedding_size: 32
# dataset config
field_separator: "\t" #指定数据集field的分隔符
seq_separator: " " #指定数据集中token_seq或者float_seq域里的分隔符
USER_ID_FIELD: user_id #指定用户id域
ITEM_ID_FIELD: item_id #指定物品id域
RATING_FIELD: rating #指定打分rating域
TIME_FIELD: timestamp #指定时间域
NEG_PREFIX: neg_ #指定负采样前缀
LABEL_FIELD: label #指定标签域
ITEM_LIST_LENGTH_FIELD: item_length #指定序列长度域
LIST_SUFFIX: _list #指定序列前缀
MAX_ITEM_LIST_LENGTH: 50 #指定最大序列长度
POSITION_FIELD: position_id #指定生成的序列位置id

max_user_inter_num: 100

min_user_inter_num: 5

max_item_inter_num: 100

min_item_inter_num: 5

loss_type: "CE"

指定从什么文件里读什么列，这里就是从ml-1m.inter里面读取user_id, item_id, rating, timestamp这四列,剩下的以此类推

load_col: inter: [user_id, item_id, rating, timestamp]

training settings

epochs: 500 #训练的最大轮数 train_batch_size: 256 #训练的batch_size learner: adam #使用的pytorch内置优化器 learning_rate: 0.001 #学习率 training_neg_sample_num: 0 #负采样数目 eval_step: 1 #每次训练后做evalaution的次数 stopping_step: 10 #控制训练收敛的步骤数，在该步骤数内若选取的评测标准没有什么变化，就可以提前停止了

evalution settings

eval_setting: TO_LS,full #对数据按时间排序，设置留一法划分数据集，并使用全排序 metrics: ["Recall", "MRR","NDCG","Hit","Precision"] #评测标准 valid_metric: MRR@10 #选取哪个评测标准作为作为提前停止训练的标准 eval_batch_size: 256 #评测的batch_size


2. your code
No extra code.
3. script for running
```python run_recbole.py --model=BERT4Rec --config_files='test.yaml'```

**Expected behavior**
A clear and concise description of what you expected to happen.

**Screenshots**
If applicable, add screenshots to help explain your problem.
![image](https://user-images.githubusercontent.com/17882988/125295997-2c587d00-e369-11eb-98f9-765600c958a9.png)

**Colab Links**
If applicable, add links to Colab or other Jupyter laboratory platforms that can reproduce the bug.

**Desktop (please complete the following information):**
 - OS: Ubuntu 18
- RecBole Version: 0.2.1
 - Python Version: 3.6.9
- PyTorch Version: 1.7.1
- cudatoolkit Version: 10.1

hyp1231 commented 3 years ago

Hi @RuihongQiu, a relatively small train batch size may cause the issue. When set train_batch_size: 1024, everything is fine.

When the batch size is small, the item sequences in a batch are all short, and the mask probability is small, then there exists a situation that all the items in a batch will not be masked (in a small probability, but it is indeed possible). Currently, no masked item results in a NaN issue. So a larger batch size will reduce the probability of crash.

Thanks for pointing out this issue, we are considering fix this in the next version. For now, you can just increase your train batch size.