RUCAIBox / RecBole

A unified, comprehensive and efficient recommendation library
https://recbole.io/
MIT License
3.27k stars 590 forks source link

[🐛BUG] 用一句话描述您的问题。运行模型代码的时候,出现了训练数据远远小于验证数据数据的情况。 #1980

Open HoupingY opened 5 months ago

HoupingY commented 5 months ago

截图: bug

实验环境:

参数配置 gpu_id: '0' worker: 8 pin_memory: True use_gpu: True seed: 2020

model config

embedding_size: 64

dataset config

load_col: inter: [user_id, item_id, timestamp]

training settings

epochs: 100
train_batch_size: 2048 learner: adam learning_rate: 0.001 eval_step: 1 stopping_step: 10

evaluation settings

eval_setting: TO_RS,full

group_by_user: True

split_ratio: [0.8,0.1,0.1] metrics: ["Recall", "MRR","NDCG","Hit"] topk: [5, 10, 20] valid_metric: MRR@10 eval_batch_size: 4096 neg_sampling: ~ transform: ~

zhengbw0324 commented 4 months ago

@HoupingY 您好!可以提供一下您的详细配置信息和数据集信息吗?

HoupingY commented 4 months ago

你好,可以的。我的详细配置信息和数据集信息以附件形式发送过去。 其中test.txt文章中包含了数据集中部分数据。 相关的配置文件在recbole配置文件的压缩包中,包括overall.yaml和finetune.yaml。

Bowen Zheng @.***> 于2024年3月1日周五 11:10写道:

@HoupingY https://github.com/HoupingY 您好!可以提供一下您的详细配置信息和数据集信息吗?

— Reply to this email directly, view it on GitHub https://github.com/RUCAIBox/RecBole/issues/1980#issuecomment-1972391947, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAUGXUG4EAOPHGMH7BQMOJTYV7WRJAVCNFSM6AAAAABCKD7D7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZSGM4TCOJUG4 . You are receiving this because you were mentioned.Message ID: @.***>

user_id:token item_id:token rating:float timestamp:float 15870291 581 2 1481633167 15870291 581 5 993 2575641 63 3 1475668272 2575641 63 2 121 19117462 80 5 1483579025 19117462 80 3 31 15874940 4 1 1478945475 15874940 4 4 226 3011722 40 2 1484607859 3011722 40 4 12124 15716021 692 5 1484617309 15716021 692 1 29 14832672 425 3 1483867177 14832672 425 5 255 16519751 424 2 1480105810 16519751 424 2 24 5131351 548 4 1484705253 5131351 548 4 2034 14137330 296 2 1482740367 14137330 296 2 413 15674502 17 5 1477096227 15674502 17 5 115 643662 714 1 1484734684 643662 714 2 3876 1013562 9110 1 1478435680 1013562 9110 3 1403 17335982 3 4 1480215325 17335982 3 2 3999 15518110 215 4 1483771656 15518110 215 5 1075 2874582 464 4 1475842004 2874582 464 3 6 613221 59 3 1484648443 613221 59 3 8 16140570 00 2 1477750570 16140570 00 5 7 6384122 7 3 1482559962 6384122 7 2 496 17005091 79 5 1484621288 17005091 79 5 2532 18728431 51 4 1478865820 18728431 51 2 40 10483710 732 4 1484660891 10483710 732 1 5309 18647020 7 5 1480592567 18647020 7 4 3385 2562420 50 2 1480762078 2562420 50 4 104 18185341 53 5 1476186001 18185341 53 1 266 5636261 346 4 1483751311

zhengbw0324 commented 4 months ago

@HoupingY 我的意思是运行时输出的配置信息以及数据集统计信息(用户数量,交互数量之类的)。

HoupingY commented 4 months ago

抱歉,我理解失误了。下面是运行输出的配置信息以及数据集统计信息: General Hyper Parameters: gpu_id = 1 use_gpu = True seed = 2024 state = INFO reproducibility = True data_path = dataset/hamazon checkpoint_dir = saved show_progress = True save_dataset = False dataset_save_path = None save_dataloaders = False dataloaders_save_path = None log_wandb = False

Training Hyper Parameters: epochs = 100 train_batch_size = 2048 learner = adam learning_rate = 0.001 train_neg_sample_args = {'distribution': 'none', 'sample_num': 'none', 'alpha': 'none', 'dynamic': False, 'candidate_num': 0} eval_step = 1 stopping_step = 10 clip_grad_norm = None weight_decay = 0.0 loss_decimal_place = 4

Evaluation Hyper Parameters: eval_args = {'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'} repeatable = True metrics = ['Recall', 'MRR', 'NDCG', 'Hit'] topk = [5, 10, 20] valid_metric = @.*** valid_metric_bigger = True eval_batch_size = 4096 metric_decimal_place = 4

Dataset Hyper Parameters: load_col = {'inter': ['user_id', 'item_id', 'timestamp']} field_separator = seq_separator = USER_ID_FIELD = user_id ITEM_ID_FIELD = item_id RATING_FIELD = rating TIME_FIELD = timestamp seq_len = None LABEL_FIELD = label threshold = None NEGPREFIX = neg unload_col = None unused_col = None additional_feat_suffix = None rm_dup_inter = None val_interval = None filter_inter_by_user_or_item = True user_inter_num_interval = [0,inf) item_inter_num_interval = [0,inf) alias_of_user_id = None alias_of_item_id = None alias_of_entity_id = None alias_of_relation_id = None preload_weight = None normalize_field = None normalize_all = None ITEM_LIST_LENGTH_FIELD = item_length LIST_SUFFIX = _list MAX_ITEM_LIST_LENGTH = 50 POSITION_FIELD = position_id HEAD_ENTITY_ID_FIELD = head_id TAIL_ENTITY_ID_FIELD = tail_id RELATION_ID_FIELD = relation_id ENTITY_ID_FIELD = entity_id benchmark_filename = None

Other Hyper Parameters: worker = 0 wandb_project = recbole shuffle = True require_pow = False enable_amp = False enable_scaler = False transform = mask_itemseq n_layers = 2 n_heads = 2 hidden_size = 64 inner_size = 256 hidden_dropout_prob = 0.5 attn_dropout_prob = 0.5 hidden_act = gelu layer_norm_eps = 1e-12 initializer_range = 0.02 mask_ratio = 0.2 loss_type = CE numerical_features = [] discretization = None kg_reverse_r = False entity_kg_num_interval = [0,inf) relation_kg_num_interval = [0,inf) MODEL_TYPE = ModelType.SEQUENTIAL embedding_size = 64 training_neg_sample_num = 0 eval_setting = TO_RS,full MODEL_INPUT_TYPE = InputType.POINTWISE eval_type = EvaluatorType.RANKING single_spec = True local_rank = 0 device = cuda eval_neg_sample_args = {'distribution': 'uniform', 'sample_num': 'none'}

01 Mar 15:12 INFO hamazon The number of users: 13725 Average actions of users: 10.48418828329933 The number of items: 70886 Average actions of items: 2.029837060026804 The number of inters: 143885 The sparsity of the dataset: 99.98521086757891% Remain Fields: ['user_id', 'item_id', 'timestamp'] 01 Mar 15:12 INFO [Training]: train_batch_size = [2048] train_neg_sample_args: [{'distribution': 'none', 'sample_num': 'none', 'alpha': 'none', 'dynamic': False, 'candidate_num': 0}] 01 Mar 15:12 INFO [Evaluation]: eval_batch_size = [4096] eval_args: [{'split': {'LS': 'valid_and_test'}, 'order': 'TO', 'mode': 'full', 'group_by': 'user'}] 01 Mar 15:12 INFO BERT4Rec( (item_embedding): Embedding(70887, 64, padding_idx=0) (position_embedding): Embedding(51, 64) (trm_encoder): TransformerEncoder( (layer): ModuleList( (0): TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=64, out_features=64, bias=True) (key): Linear(in_features=64, out_features=64, bias=True) (value): Linear(in_features=64, out_features=64, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.5, inplace=False) (dense): Linear(in_features=64, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.5, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=64, out_features=256, bias=True) (dense_2): Linear(in_features=256, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) ) (1): TransformerLayer( (multi_head_attention): MultiHeadAttention( (query): Linear(in_features=64, out_features=64, bias=True) (key): Linear(in_features=64, out_features=64, bias=True) (value): Linear(in_features=64, out_features=64, bias=True) (softmax): Softmax(dim=-1) (attn_dropout): Dropout(p=0.5, inplace=False) (dense): Linear(in_features=64, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (out_dropout): Dropout(p=0.5, inplace=False) ) (feed_forward): FeedForward( (dense_1): Linear(in_features=64, out_features=256, bias=True) (dense_2): Linear(in_features=256, out_features=64, bias=True) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) ) ) ) (LayerNorm): LayerNorm((64,), eps=1e-12, elementwise_affine=True) (dropout): Dropout(p=0.5, inplace=False) ) Trainable parameters: 4640128

Bowen Zheng @.***> 于2024年3月1日周五 11:34写道:

@HoupingY https://github.com/HoupingY 我的意思是运行时输出的配置信息以及数据集统计信息(用户数量,交互数量之类的)。

— Reply to this email directly, view it on GitHub https://github.com/RUCAIBox/RecBole/issues/1980#issuecomment-1972436167, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAUGXUGKXVMMXGSPW5TNQRLYV7ZMFAVCNFSM6AAAAABCKD7D7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZSGQZTMMJWG4 . You are receiving this because you were mentioned.Message ID: @.***>

zhengbw0324 commented 4 months ago

@HoupingY 您好! 测试过程很长是因为您的 eval_batch_size设置过小,在使用全排序测试的情况下,真实的测试batch_size=eval_batch_size/item_num。具体您可以参考我们在issue #1890的回答。

HoupingY commented 4 months ago

你好,我在调整了 eval_batch_size以后即,将 eval_batch_size调大以后,运行代码训练和测试时长依旧没有发生任何变化。(有一个很奇怪的情况,我调整数据集后,序列推荐模型(如SASRec)的测试时间就变得很短(几秒钟),但是对于其他的模型尤其是图相关的模型(如lightGCN,NGCF等)依然是测试时间很长。)

Bowen Zheng @.***> 于2024年3月2日周六 17:03写道:

@HoupingY https://github.com/HoupingY 您好! 测试过程很长是因为您的 eval_batch_size设置过小,在使用全排序测试的情况下,真实的测试 batch_size=eval_batch_size/item_num。具体您可以参考我们在issue #1890 https://github.com/RUCAIBox/RecBole/issues/1890的回答。

— Reply to this email directly, view it on GitHub https://github.com/RUCAIBox/RecBole/issues/1980#issuecomment-1974737103, or unsubscribe https://github.com/notifications/unsubscribe-auth/BAUGXUBP32VUZS4BPFBIN7DYWGIU7AVCNFSM6AAAAABCKD7D7KVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNZUG4ZTOMJQGM . You are receiving this because you were mentioned.Message ID: @.***>

NEUYuYang commented 3 months ago

请问您这个问题解决了吗,我在lightgcn模型里也遇到了同样的问题

HoupingY commented 3 months ago

请问您这个问题解决了吗,我在lightgcn模型里也遇到了同样的问题

没有呢,感觉很奇怪