bigcode-project / selfcodealign

[NeurIPS'24] SelfCodeAlign: Self-Alignment for Code Generation
https://arxiv.org/abs/2410.24198
Apache License 2.0
276 stars 20 forks source link

Reproducing StarCoder2-Instruct #6

Open mstallone opened 7 months ago

mstallone commented 7 months ago

I am trying to recreate the StarCoder2-Instruct-v0.1 model; however, the model produced by the provided command in the README (copied below) does not match the evaluation of the StarCoder2-Instruct-v0.1 model on HF.

I actually see quite a bit of discrepancy between the two models' evaluations: humaneval on your HF version is 7 points higher than on my reproduced model (both models were evaluated locally by me in the same environment).

MODEL_KEY=bigcode/starcoder2-15b
LR=1e-5
EPOCH=4
SEQ_LEN=1280
WARMUP_RATIO=0.05
OUTPUT_DIR=/path/to/output_model
DATASET_FILE=/path/to/50k-dataset.jsonl
accelerate launch -m star_align.train \
    --model_key $MODEL_KEY \
    --model_name_or_path $MODEL_KEY \
    --use_flash_attention True \
    --datafile_paths $DATASET_FILE \
    --output_dir $OUTPUT_DIR \
    --bf16 True \
    --num_train_epochs $EPOCH \
    --max_training_seq_length $SEQ_LEN \
    --pad_to_max_length False \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 64 \
    --group_by_length False \
    --ddp_find_unused_parameters False \
    --logging_steps 1 \
    --log_level info \
    --optim adafactor \
    --max_grad_norm -1 \
    --warmup_ratio $WARMUP_RATIO \
    --learning_rate $LR \
    --lr_scheduler_type linear

Are the parameters in the README correct for the released model? Are you adding anything in your accelerate config? i.e. any model wrappers or something else?

For the data, I just ran:

>>> from datasets import load_dataset
>>> load_dataset("bigcode/self-oss-instruct-sc2-exec-filter-50k", split="train").to_json("/path/to/50k-dataset.jsonl", lines=True)

Do you have any ideas on how I can reproduce your model? Thanks!

UniverseFly commented 6 months ago

Hi @mstallone, thanks for asking. Let me attach the logs for each training step here. To best reproduce the HumanEval score, you can follow the steps outlined in the evaluation folder. The v0.1 model might be sensitive to the prompt, but all the scores should be reproducible if you follow these instructions. We are also actively improving the pipeline. Feel free to let us know if you have any further questions.

trainer_state.json

mstallone commented 6 months ago

Thank you so much for the response! If you wouldn't mind, could you also share the training_args.bin? Comparing the logs, my loss is about 0.3 higher than what you have 🤔

Also, am I loading the data correctly?

UniverseFly commented 6 months ago

Ah I think it's due to the dropout. starcoder2-15b sets the dropout value to 0.1 by default. We did not apply dropout during finetuning, so that's why your loss is 0.3 higher. Sorry I forgot to mention this in the README. Now the README should have the updated script. Thank you @mstallone for reporting this!

Your data loading strategy and hyperparameters should be correct.

mstallone commented 6 months ago

Thank you, I appreciate your help here. Loss looks a lot more aligned with what you saw. I will confirm when I have the final loss and will close the issue.

Also, are you by chance missing gradient checkpointing? At least for your 1 GPU job example

UniverseFly commented 6 months ago

I don't think I enabled gradient checkpointing. This set of hyperparameters just barely allows the starcoder2-15b to fit on one A100 80G. If it helps, here is the JSON form of trainer_args.bin:

{
  "output_dir": "sc2-15b-ft",
  "overwrite_output_dir": false,
  "do_train": false,
  "do_eval": false,
  "do_predict": false,
  "evaluation_strategy": "no",
  "prediction_loss_only": false,
  "per_device_train_batch_size": 1,
  "per_device_eval_batch_size": 8,
  "per_gpu_train_batch_size": null,
  "per_gpu_eval_batch_size": null,
  "gradient_accumulation_steps": 64,
  "eval_accumulation_steps": null,
  "eval_delay": 0,
  "learning_rate": 1e-05,
  "weight_decay": 0.0,
  "adam_beta1": 0.9,
  "adam_beta2": 0.999,
  "adam_epsilon": 1e-08,
  "max_grad_norm": -1.0,
  "num_train_epochs": 4.0,
  "max_steps": -1,
  "lr_scheduler_type": "linear",
  "lr_scheduler_kwargs": {},
  "warmup_ratio": 0.05,
  "warmup_steps": 0,
  "log_level": "info",
  "log_level_replica": "warning",
  "log_on_each_node": true,
  "logging_dir": "...",
  "logging_strategy": "steps",
  "logging_first_step": false,
  "logging_steps": 1.0,
  "logging_nan_inf_filter": true,
  "save_strategy": "epoch",
  "save_steps": 0.167,
  "save_total_limit": null,
  "save_safetensors": true,
  "save_on_each_node": false,
  "save_only_model": false,
  "no_cuda": false,
  "use_cpu": false,
  "use_mps_device": false,
  "seed": 42,
  "data_seed": null,
  "jit_mode_eval": false,
  "use_ipex": false,
  "bf16": true,
  "fp16": false,
  "fp16_opt_level": "O1",
  "half_precision_backend": "auto",
  "bf16_full_eval": false,
  "fp16_full_eval": false,
  "tf32": null,
  "local_rank": 0,
  "ddp_backend": null,
  "tpu_num_cores": null,
  "tpu_metrics_debug": false,
  "debug": [],
  "dataloader_drop_last": false,
  "eval_steps": null,
  "dataloader_num_workers": 0,
  "dataloader_prefetch_factor": null,
  "past_index": -1,
  "run_name": "sc2-15b-ft",
  "disable_tqdm": false,
  "remove_unused_columns": true,
  "label_names": null,
  "load_best_model_at_end": false,
  "metric_for_best_model": null,
  "greater_is_better": null,
  "ignore_data_skip": false,
  "fsdp": [],
  "fsdp_min_num_params": 0,
  "fsdp_config": {
    "min_num_params": 0,
    "xla": false,
    "xla_fsdp_v2": false,
    "xla_fsdp_grad_ckpt": false
  },
  "fsdp_transformer_layer_cls_to_wrap": null,
  "accelerator_config": {
    "split_batches": false,
    "dispatch_batches": null,
    "even_batches": true,
    "use_seedable_sampler": true
  },
  "deepspeed": null,
  "label_smoothing_factor": 0.0,
  "optim": "adafactor",
  "optim_args": null,
  "adafactor": false,
  "group_by_length": false,
  "length_column_name": "length",
  "report_to": [
    "tensorboard"
  ],
  "ddp_find_unused_parameters": false,
  "ddp_bucket_cap_mb": null,
  "ddp_broadcast_buffers": null,
  "dataloader_pin_memory": true,
  "dataloader_persistent_workers": false,
  "skip_memory_metrics": true,
  "use_legacy_prediction_loop": false,
  "push_to_hub": false,
  "resume_from_checkpoint": null,
  "hub_model_id": null,
  "hub_strategy": "every_save",
  "hub_token": "<HUB_TOKEN>",
  "hub_private_repo": false,
  "hub_always_push": false,
  "gradient_checkpointing": false,
  "gradient_checkpointing_kwargs": null,
  "include_inputs_for_metrics": false,
  "eval_do_concat_batches": true,
  "fp16_backend": "auto",
  "push_to_hub_model_id": null,
  "push_to_hub_organization": null,
  "push_to_hub_token": "<PUSH_TO_HUB_TOKEN>",
  "mp_parameters": "",
  "auto_find_batch_size": false,
  "full_determinism": false,
  "torchdynamo": null,
  "ray_scope": "last",
  "ddp_timeout": 1800,
  "torch_compile": false,
  "torch_compile_backend": null,
  "torch_compile_mode": null,
  "dispatch_batches": null,
  "split_batches": null,
  "include_tokens_per_second": false,
  "include_num_input_tokens_seen": false,
  "neftune_noise_alpha": null,
  "optim_target_modules": null
}