Closed WeiJunZhou23 closed 1 year ago
Hi, when you specify your own separate train/val/test sets, the model will be trained directly on the data you provided, and the data splitting process is not involved. Therefore, the seed
argument does not affect the data the model sees during training in this case.
Please refer to the following codes:
# Split data
debug(f'Splitting data with seed {args.seed}')
if args.separate_test_path:
test_data = get_data(path=args.separate_test_path, args=args, features_path=args.separate_test_features_path, logger=logger)
if args.separate_val_path:
val_data = get_data(path=args.separate_val_path, args=args, features_path=args.separate_val_features_path, logger=logger)
if args.separate_val_path and args.separate_test_path:
train_data = data
elif args.separate_val_path:
train_data, _, test_data = split_data(data=data, split_type=args.split_type, sizes=(0.8, 0.2, 0.0), seed=args.seed, args=args, logger=logger)
elif args.separate_test_path:
train_data, val_data, _ = split_data(data=data, split_type=args.split_type, sizes=(0.8, 0.2, 0.0), seed=args.seed, args=args, logger=logger)
else:
print('='*100)
train_data, val_data, test_data = split_data(data=data, split_type=args.split_type, sizes=args.split_sizes, seed=args.seed, args=args, logger=logger)
The Splitting data with seed 1
log is a default message that appears in the log, but it doesn't apply if you provide your own separate datasets. The slight variation in model performance across different runs could be due to other factors, such as stochasticity in the optimization process, initialization of the model parameters, etc.
Thank you for your clarification, I really appreciate your work.
When we set up customized train/val/test set and didn't set a value for
seed
, does theseed
argument still affect how the model is being trained? In my case, I didn't set a value forseed
, but I can still seeSplitting data with seed 1
in thetrain.log
, and model performance in different runs is slightly different. Here is my command:command: python train.py --data_path './dataset_train.csv' --separate_val_path './dataset_val.csv' --separate_test_path './dataset_test.csv' --dataset_type classification --epochs 100 --num_runs 3 --gpu 0 --batch_size 256 --init_lr '1e-4' --ensemble_size 1 --step functional_prompt --exp_name finetune --exp_id activity --checkpoint_path './dumped/pretrained_graph_encoder/original_CMPN_0623_1350_14000th_epoch.pkl' --exp_id "activity"