furiosa-ai / ssm-peft

12 stars 1 forks source link

question: hyperparameters #1

Open puigde opened 2 weeks ago

puigde commented 2 weeks ago

Hi,

Thanks for providing a public implementation for the experimental results of your paper.

I am trying to reproduce the results, regarding hyperparameters in the paper it is stated (quote):

For each dataset, we choose the model size of Mamba depending on how challenging the dataset is
and perform a small grid search for one epoch on a subset of the data (1k-2k instances) with learning
rates {4 × 10−1, 2 × 10−1, 1 × 10−1, ..., 1 × 10−5} to find the optimal learning rate of each PEFT
method. Afterward, we train the best setting for each PEFT method on the full data for several epochs
(Table 6) using an NVIDIA RTX 3090 GPU for the 130M model and an NVIDIA A100 for the larger
1.4B and 2.8B models in mixed precision (BF16). We only report the validation metric of the best
epoch during training (early stopping) in our results. We fine-tune the Mamba models (Gu & Dao,
2024) pretrained from Pile (Gao et al., 2020) with AdamW with a linear learning rate decay schedule.
For LoRA we set rank to 8, alpha to 8, and dropout to 0.1 for all experiments. For evaluating NLG
tasks, we employ beam search with five beams and a maximum beam length of 1024.

I am not finding the piece of the code where this search is performed, are the hyperparameters in ssm-peft/<model>/cfg/exps/~ the final ones? can we assume they are consistent across parts of the model?

Thanks in advance.

kevingalim commented 1 week ago

Hi,

Thank you for your interest in our work and for reaching out with your question. You're correct in noting that our public implementation does not include the specific configuration files for the grid search process. The hyperparameters (learning rate) indicated in the ssm-peft//cfg/exps/~ directory are indeed the final ones that were selected based on our internal grid search.

puigde commented 1 week ago

Hi,

Thanks for the response.

For lora, by the expression in your launching command *lora_outproj*.yaml in ssm-peft/<model>/cfg/exps/~, I assume the config to use would be 006_lora_r8_lora_outproj.yaml and based on your response that it contains the final parameters, each config, for each dataset. Which if I am not wrong would account for:

common_params:
  peft: cfg/peft/lora/r8/lora_outproj.json
  prec: bf16
  batch_size: 4
glue:
  learning_rate: 0.001
  num_epochs: 10
  model: state-spaces/mamba-130m
cifar:
  learning_rate: 0.004
  num_epochs: 5
  model: state-spaces/mamba-130m
dart:
  learning_rate: 0.004
  num_epochs: 10
  model: state-spaces/mamba-130m
  eval_gen:
    max_length: 1024
    min_length: 5
    num_beams: 5
samsum:
  learning_rate: 0.002
  model: state-spaces/mamba-1.4b
  num_epochs: 10
  eval_gen:
    max_length: 1024
    min_length: 5
    num_beams: 5
spider:
  learning_rate: 0.002
  model: state-spaces/mamba-1.4b
  num_epochs: 10
spider-larger:
  learning_rate: 0.002
  model: state-spaces/mamba-2.8b
  num_epochs: 10

cfg/peft/lora/r8/lora_outproj.json

{
    "target_modules": [
      "out_proj"
    ],
    "r": 8,
    "lora_alpha": 8,
    "lora_dropout": 0.1,

    "alpha_pattern": {},
    "auto_mapping": null,
    "base_model_name_or_path": null,
    "bias": "none",
    "fan_in_fan_out": false,
    "inference_mode": false,
    "init_lora_weights": true,
    "layers_pattern": null,
    "layers_to_transform": null,
    "loftq_config": {},
    "megatron_config": null,
    "megatron_core": "megatron.core",
    "modules_to_save": null,
    "peft_type": "LORA",
    "rank_pattern": {},
    "revision": null,
    "task_type": "SEQ_2_SEQ_LM",
    "use_dora": false,
    "use_rslora": false
    }

Also I have two follow-up questions:

  1. For GLUE. For all tasks, did you finetune and test from the same hf pretrained checkpoint? I have seen some people finetune from different checkpoints (for example: mnli checkpoint for other tasks). From the text in the paper I assume not, but still wanted to make sure.
  2. For Mamba, is linear decay used? from the quoted text it says so but I have not found it in the code. For reference, in mamba-peft/train.py the trainer init (l.143) is
    
    print("Dropping last batch")
    trainer = MambaTrainer(
    model=model,
    train_dataset=train_data_module.dataset,
    tokenizer=tokenizer,
    args=MambaTrainingArguments(
        learning_rate=learning_rate,
        max_steps=int(num_epochs * its_per_epoch),
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=1,
        gradient_accumulation_steps=gradient_accumulation_steps,
        optim=optim,
        output_dir=output_dir,
        logging_steps=logging_steps,
        dataloader_num_workers=num_data_workers,
        dataloader_prefetch_factor=2,
        eval_accumulation_steps=128,
        info={
            "trainable_params": get_trainable_parameters_ratio(model),
            "cfg_path": cfg_path
        },
        save_strategy="steps" if not no_save else "no",
        evaluation_strategy="steps" if not skip_eval else "no",
        save_steps=int(eval_epochs * its_per_epoch),
        eval_steps=int(eval_epochs * its_per_epoch),
        dataloader_drop_last=True,
        report_to="wandb",
        seed=seed,
    ),
    compute_metrics=compute_metrics,
    data_collator=train_data_module.data_collator,
    eval_dataset=val_data_module.dataset,
    eval_generator=eval_generator,
    min_eval_metric_after_epoch=min_eval_metric_after_epoch,
    )

trainer.train(resume_from_checkpoint=resume_from_checkpoint)



Thanks in advance