HICAI-ZJU / KANO

Code and data for the Nature Machine Intelligence paper "Knowledge graph-enhanced molecular contrastive learning with functional prompt".
MIT License
108 stars 24 forks source link

Dose `seed` affect the training when we set customized train/val/test set? #12

Closed WeiJunZhou23 closed 1 year ago

WeiJunZhou23 commented 1 year ago

When we set up customized train/val/test set and didn't set a value for seed, does the seed argument still affect how the model is being trained? In my case, I didn't set a value for seed, but I can still see Splitting data with seed 1 in the train.log, and model performance in different runs is slightly different. Here is my command: command: python train.py --data_path './dataset_train.csv' --separate_val_path './dataset_val.csv' --separate_test_path './dataset_test.csv' --dataset_type classification --epochs 100 --num_runs 3 --gpu 0 --batch_size 256 --init_lr '1e-4' --ensemble_size 1 --step functional_prompt --exp_name finetune --exp_id activity --checkpoint_path './dumped/pretrained_graph_encoder/original_CMPN_0623_1350_14000th_epoch.pkl' --exp_id "activity"

ZJU-Fangyin commented 1 year ago

Hi, when you specify your own separate train/val/test sets, the model will be trained directly on the data you provided, and the data splitting process is not involved. Therefore, the seed argument does not affect the data the model sees during training in this case.

Please refer to the following codes:

  # Split data
  debug(f'Splitting data with seed {args.seed}')
  if args.separate_test_path:
      test_data = get_data(path=args.separate_test_path, args=args, features_path=args.separate_test_features_path, logger=logger)
  if args.separate_val_path:
      val_data = get_data(path=args.separate_val_path, args=args, features_path=args.separate_val_features_path, logger=logger)

  if args.separate_val_path and args.separate_test_path:
      train_data = data
  elif args.separate_val_path:
      train_data, _, test_data = split_data(data=data, split_type=args.split_type, sizes=(0.8, 0.2, 0.0), seed=args.seed, args=args, logger=logger)
  elif args.separate_test_path:
      train_data, val_data, _ = split_data(data=data, split_type=args.split_type, sizes=(0.8, 0.2, 0.0), seed=args.seed, args=args, logger=logger)
  else:
      print('='*100)
      train_data, val_data, test_data = split_data(data=data, split_type=args.split_type, sizes=args.split_sizes, seed=args.seed, args=args, logger=logger)

The Splitting data with seed 1 log is a default message that appears in the log, but it doesn't apply if you provide your own separate datasets. The slight variation in model performance across different runs could be due to other factors, such as stochasticity in the optimization process, initialization of the model parameters, etc.

WeiJunZhou23 commented 1 year ago

Thank you for your clarification, I really appreciate your work.