batch size? - Githubissues

amueller commented 1 year ago

Hey! So in PriorFittingCustomPrior.ipynb you first set a batch size of 64, and then overwrite it with 4, but the paper specifies it as 512, right? I assume overwriting the 64 with 4 was accidental for adding the plotting code, but I'm not sure if I misunderstood the meaning of batch here or if the notebook just runs a different config than what's reported in the paper.

noahho commented 1 year ago

Hi! Sorry some parts of our code are unclear.

Our trained models / classifiers save the parameters used for training and you can retrieve these parameters using:

classifier = TabPFNClassifier(device='cpu', N_ensemble_configurations=32, base_path=tabpfn_path)
classifier.c

Relevant for the used batch size is: batch_size (8), aggregate_k_gradients (8) and the number of parallel devices used (8, which is not saved). Multiplying these gives us the correct batch size 512.

I printed our full published classifiers config down here. Note that functions are just converted to strings so this config can't be directly used to initialize a new configuration.

{'lr': 0.0001,
 'dropout': 0.0,
 'emsize': 512,
 'batch_size': 8,
 'nlayers': 12,
 'num_features': 100,
 'nhead': 4,
 'nhid_factor': 2,
 'bptt': 1024,
 'eval_positions': [972],
 'seq_len_used': 50,
 'sampling': 'mixed',
 'epochs': 400,
 'num_steps': 1024,
 'verbose': False,
 'mix_activations': True,
 'nan_prob_unknown_reason_reason_prior': 1.0,
 'categorical_feature_p': 0.2,
 'nan_prob_no_reason': 0.0,
 'nan_prob_unknown_reason': 0.0,
 'nan_prob_a_reason': 0.0,
 'max_num_classes': 10,
 'num_classes': '<function <lambda>.<locals>.<lambda> at 0x7fc575dfb550>',
 'noise_type': 'Gaussian',
 'balanced': False,
 'normalize_to_ranking': False,
 'set_value_to_nan': 0.1,
 'normalize_by_used_features': True,
 'num_features_used': {'uniform_int_sampler_f(3,max_features)': '<function <lambda>.<locals>.<lambda> at 0x7fc575dfb5e0>'},
 'num_categorical_features_sampler_a': -1.0,
 'differentiable_hyperparameters': {'prior_bag_exp_weights_1': {'distribution': 'uniform',
   'min': 1000000.0,
   'max': 1000001.0},
  'num_layers': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 6,
   'min_mean': 1,
   'round': True,
   'lower_bound': 2},
  'prior_mlp_hidden_dim': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 130,
   'min_mean': 5,
   'round': True,
   'lower_bound': 4},
  'prior_mlp_dropout_prob': {'distribution': 'meta_beta',
   'scale': 0.9,
   'min': 0.1,
   'max': 5.0},
  'noise_std': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 0.3,
   'min_mean': 0.0001,
   'round': False,
   'lower_bound': 0.0},
  'init_std': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 10.0,
   'min_mean': 0.01,
   'round': False,
   'lower_bound': 0.0},
  'num_causes': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 12,
   'min_mean': 1,
   'round': True,
   'lower_bound': 1},
  'is_causal': {'distribution': 'meta_choice', 'choice_values': [True, False]},
  'pre_sample_weights': {'distribution': 'meta_choice',
   'choice_values': [True, False]},
  'y_is_effect': {'distribution': 'meta_choice',
   'choice_values': [True, False]},
  'prior_mlp_activations': {'distribution': 'meta_choice_mixed',
   'choice_values': ["<class 'torch.nn.modules.activation.Tanh'>",
    "<class 'torch.nn.modules.linear.Identity'>",
    '<function get_diff_causal.<locals>.<lambda> at 0x7fc575dfb670>',
    "<class 'torch.nn.modules.activation.ELU'>"]},
  'block_wise_dropout': {'distribution': 'meta_choice',
   'choice_values': [True, False]},
  'sort_features': {'distribution': 'meta_choice',
   'choice_values': [True, False]},
  'in_clique': {'distribution': 'meta_choice', 'choice_values': [True, False]},
  'sampling': {'distribution': 'meta_choice',
   'choice_values': ['normal', 'mixed']},
  'pre_sample_causes': {'distribution': 'meta_choice',
   'choice_values': [True, False]},
  'outputscale': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 10.0,
   'min_mean': 1e-05,
   'round': False,
   'lower_bound': 0},
  'lengthscale': {'distribution': 'meta_trunc_norm_log_scaled',
   'max_mean': 10.0,
   'min_mean': 1e-05,
   'round': False,
   'lower_bound': 0},
  'noise': {'distribution': 'meta_choice',
   'choice_values': [1e-05, 0.0001, 0.01]},
  'multiclass_type': {'distribution': 'meta_choice',
   'choice_values': ['value', 'rank']}},
 'prior_type': 'prior_bag',
 'differentiable': True,
 'flexible': True,
 'aggregate_k_gradients': 8,
 'recompute_attn': True,
 'bptt_extra_samples': None,
 'dynamic_batch_size': False,
 'multiclass_loss_type': 'nono',
 'output_multiclass_ordered_p': 0.0,
 'normalize_with_sqrt': False,
 'new_mlp_per_example': True,
 'prior_mlp_scale_weights_sqrt': True,
 'batch_size_per_gp_sample': None,
 'normalize_ignore_label_too': True,
 'differentiable_hps_as_style': False,
 'max_eval_pos': 1000,
 'random_feature_rotation': True,
 'rotate_normalized_labels': True,
 'canonical_y_encoder': False,
 'total_available_time_in_s': None,
 'train_mixed_precision': True,
 'efficient_eval_masking': True,
 'multiclass_type': 'rank',
 'done_part_in_training': 0.8425}

wangy8989 commented 1 year ago

I am also confused about the parameter "steps", where in the paper you said there are "18 000 steps" and in the config: 'epochs': 400, 'num_steps': 1024. Where do you get 18000 from? Thanks!

noahho commented 1 year ago

In the paper, we write "steps" for optimization steps, while in the code a step is a computational step that gets the gradients for one batch, however we aggregate 8. batches here (aggregate_k_gradients). This yields 400*1024/8 = 51,200 optimization steps in total. We did not finish the training for the final run, but only did 84% of it, thus we should have .84*400*1024/8 = 43,008 (optimization) steps. This still does not add up to the 18 000 steps though.. we likely put in the number of an older model and forgot to change that, sorry for the confusion and thank you for asking. We are changing that in our next paper version as well.

amueller commented 1 year ago

I'm still a bit confused by the batch_size. It says config['batch_size'] = 8*config['aggregate_k_gradients'] in the notebook, but above you list batch_size=8. Are these the same batch_size? I guess I'm confused by whether the variable stores the value with or without multiplication with aggregate_k_gradients. For the training it would make sense to store without, so I'm a bit confused by the notebook code. If I have one GPU, should that say config['batch_size'] = 64 or config['batch_size'] = 64 *config['aggregate_k_gradients'] assuming that I have aggregate_k_gradients=8.

wangy8989 commented 1 year ago

@noahho Thanks for answering! Could you also explain what bptt means? Thanks!

SamuelGabriel commented 1 year ago

We use it as another word for seq_len interchangeably :)

amueller commented 1 year ago

@SamuelGabriel could you maybe clarify the use of batch_size that I asked about above and how it interacts with aggregate_k_gradients?

noahho commented 1 year ago

Hi Andreas, You are calling the get_model function afterwards, which divides the passed batch_size by aggregate_k_gradients. Thus you should pass a batch size of 64 (The effective batch size for training, since gradients are still only updated after 64 steps), but only GPU RAM for a batch_size of 8 will be used. If your machine has sufficient GPU RAM to hold more samples, you can reduce aggregate_k_gradients. This has no effect on the actual gradient upates, but might execute faster.

amueller commented 1 year ago

Great, thank you for the clarification. I was a bit confused by when the multiplication and division happen, that clarifies it!

amueller commented 1 year ago

Ok so I might still be missing something. So the batch_size that's passed in via the config is the mathematical batch size per GPU, right? Within get_model this is divided by aggregate_k_gradients and then we do a loop over aggregate_k_gradients with a tensor size of batch_size//aggregate_k_gradients per group and sum them up.

I have plenty of GPU memory I think (I have 4 A100s). I launch on 4 GPUs with config['batch_size'] = 64 and config['aggregate_k_gradients'] = 8. That should be equivalent to your setup according to what you said above. Now if I understand you correctly, whatever I specify for aggregate_k_gradients only changes the memory requirements / makes a bigger/smaller tensor and a longer or shorter for-loop. Then, with lot of ram, aggregate_k_gradients=1 should be fasted, right? But if I set aggregate_k_gradients=8 an epoch takes ~190s and with aggregate_k_gradients=4 it takes ~380s and aggregate_k_gradients=16 an epoch takes ~90s.

I'm leaving config['num_steps'] = 1024//config['aggregate_k_gradients'] for all setups. Is that the issue?

amueller commented 1 year ago

@noahho not sure if you saw this, I'd love some clarification here.

noahho commented 1 year ago

So we pass the config to model_builder.get_model (https://github.com/automl/TabPFN/blob/main/tabpfn/scripts/model_builder.py#L191), which makes modifications on the config and then uses that config for the train function (https://github.com/automl/TabPFN/blob/36331227a00cfa016631af1d358bfce2330c9540/tabpfn/train.py#L33).

You can see that in get_model we modify the batch_size using aggregate_k_gradients. The batch size that the train function receives is the actual train size allocated per GPU.

"Now if I understand you correctly, whatever I specify for aggregate_k_gradients only changes the memory requirements / makes a bigger/smaller tensor and a longer or shorter for-loop." - exactly, this parameter sets into how many chunks that batch size should be divided. If you have more memory available, set aggregate_k_gradients to 1.

"Then, with lot of ram, aggregate_k_gradients=1 should be fasted, right? But if I set aggregate_k_gradients=8 an epoch takes ~190s and with aggregate_k_gradients=4 it takes ~380s and aggregate_k_gradients=16 an epoch takes ~90s.

I'm leaving config['num_steps'] = 1024//config['aggregate_k_gradients'] for all setups. Is that the issue?" I think this explains the speed quite well. You are reducing the number of steps done, with larger aggregate_k_gradients, which makes it faster. I'm not sure where we added this line: "config['num_steps'] = 1024//config['aggregate_k_gradients']", but you should remove that to get more logical scaling. Also the parameter configuration, that I posted above refers to the configuration after whatever division/adaption of parameters we did in the notebook or get_model function. It is the actual config passed to the train function.

amueller commented 1 year ago

Ok thanks, that confirms my current understanding. I was really thrown off by the fact that PriorFittingCustomPrior.ipynb parametrizes num_steps and batch_size as a function of aggregate_k_gradients, maybe you want to change that?

So my last question is whether it's possible from the config of the model alone to determine the effective batch size? You need to know the number of parallel devices, which is not stored, right?

amueller commented 1 year ago

ps: the speed improvement of decreasing aggregate_k_gradients is minimal :)

noahho commented 1 year ago

Great, I'm happy this clarifies things :) Iit is a bit confusing, but i'll keep the code like this for now since our code is rather research-y and we couldn't clean up all those places, unfortunately. Probably your gpu is already quite utilized, then increasing the batch size doesn't help so much.

wangy8989 commented 1 year ago

In the paper, we write "steps" for optimization steps, while in the code a step is a computational step that gets the gradients for one batch, however we aggregate 8. batches here (aggregate_k_gradients). This yields 4001024/8 = 51,200 optimization steps in total. We did not finish the training for the final run, but only did 84% of it, thus we should have .84400*1024/8 = 43,008 (optimization) steps. This still does not add up to the 18 000 steps though.. we likely put in the number of an older model and forgot to change that, sorry for the confusion and thank you for asking. We are changing that in our next paper version as well.

@noahho Is it also a proportion of optimization steps used in the previous TransformersCanDoBayesianInference model? (like 84% in TabPFN you said above) What is the set of hyperparameters used in that model? Thanks!

noahho commented 1 year ago

Hey Yicheng, We have a repository with details at https://github.com/automl/TransformersCanDoBayesianInference . Best Noah

wangy8989 commented 1 year ago

Hey Yicheng, We have a repository with details at https://github.com/automl/TransformersCanDoBayesianInference . Best Noah

Yes, I know. I have similar questions in this post regarding that project. Could you answer them? about the hyperparameters used. Or do I post there? However, I already posted a question there before but didn't get a reply.

automl / TabPFN

batch size? #26