Open mstallone opened 7 months ago
Hi @mstallone, thanks for asking. Let me attach the logs for each training step here. To best reproduce the HumanEval score, you can follow the steps outlined in the evaluation
folder. The v0.1 model might be sensitive to the prompt, but all the scores should be reproducible if you follow these instructions. We are also actively improving the pipeline. Feel free to let us know if you have any further questions.
Thank you so much for the response! If you wouldn't mind, could you also share the training_args.bin
? Comparing the logs, my loss is about 0.3 higher than what you have 🤔
Also, am I loading the data correctly?
Ah I think it's due to the dropout. starcoder2-15b sets the dropout value to 0.1
by default. We did not apply dropout during finetuning, so that's why your loss is 0.3 higher. Sorry I forgot to mention this in the README. Now the README should have the updated script. Thank you @mstallone for reporting this!
Your data loading strategy and hyperparameters should be correct.
Thank you, I appreciate your help here. Loss looks a lot more aligned with what you saw. I will confirm when I have the final loss and will close the issue.
Also, are you by chance missing gradient checkpointing? At least for your 1 GPU job example
I don't think I enabled gradient checkpointing. This set of hyperparameters just barely allows the starcoder2-15b to fit on one A100 80G. If it helps, here is the JSON form of trainer_args.bin
:
{
"output_dir": "sc2-15b-ft",
"overwrite_output_dir": false,
"do_train": false,
"do_eval": false,
"do_predict": false,
"evaluation_strategy": "no",
"prediction_loss_only": false,
"per_device_train_batch_size": 1,
"per_device_eval_batch_size": 8,
"per_gpu_train_batch_size": null,
"per_gpu_eval_batch_size": null,
"gradient_accumulation_steps": 64,
"eval_accumulation_steps": null,
"eval_delay": 0,
"learning_rate": 1e-05,
"weight_decay": 0.0,
"adam_beta1": 0.9,
"adam_beta2": 0.999,
"adam_epsilon": 1e-08,
"max_grad_norm": -1.0,
"num_train_epochs": 4.0,
"max_steps": -1,
"lr_scheduler_type": "linear",
"lr_scheduler_kwargs": {},
"warmup_ratio": 0.05,
"warmup_steps": 0,
"log_level": "info",
"log_level_replica": "warning",
"log_on_each_node": true,
"logging_dir": "...",
"logging_strategy": "steps",
"logging_first_step": false,
"logging_steps": 1.0,
"logging_nan_inf_filter": true,
"save_strategy": "epoch",
"save_steps": 0.167,
"save_total_limit": null,
"save_safetensors": true,
"save_on_each_node": false,
"save_only_model": false,
"no_cuda": false,
"use_cpu": false,
"use_mps_device": false,
"seed": 42,
"data_seed": null,
"jit_mode_eval": false,
"use_ipex": false,
"bf16": true,
"fp16": false,
"fp16_opt_level": "O1",
"half_precision_backend": "auto",
"bf16_full_eval": false,
"fp16_full_eval": false,
"tf32": null,
"local_rank": 0,
"ddp_backend": null,
"tpu_num_cores": null,
"tpu_metrics_debug": false,
"debug": [],
"dataloader_drop_last": false,
"eval_steps": null,
"dataloader_num_workers": 0,
"dataloader_prefetch_factor": null,
"past_index": -1,
"run_name": "sc2-15b-ft",
"disable_tqdm": false,
"remove_unused_columns": true,
"label_names": null,
"load_best_model_at_end": false,
"metric_for_best_model": null,
"greater_is_better": null,
"ignore_data_skip": false,
"fsdp": [],
"fsdp_min_num_params": 0,
"fsdp_config": {
"min_num_params": 0,
"xla": false,
"xla_fsdp_v2": false,
"xla_fsdp_grad_ckpt": false
},
"fsdp_transformer_layer_cls_to_wrap": null,
"accelerator_config": {
"split_batches": false,
"dispatch_batches": null,
"even_batches": true,
"use_seedable_sampler": true
},
"deepspeed": null,
"label_smoothing_factor": 0.0,
"optim": "adafactor",
"optim_args": null,
"adafactor": false,
"group_by_length": false,
"length_column_name": "length",
"report_to": [
"tensorboard"
],
"ddp_find_unused_parameters": false,
"ddp_bucket_cap_mb": null,
"ddp_broadcast_buffers": null,
"dataloader_pin_memory": true,
"dataloader_persistent_workers": false,
"skip_memory_metrics": true,
"use_legacy_prediction_loop": false,
"push_to_hub": false,
"resume_from_checkpoint": null,
"hub_model_id": null,
"hub_strategy": "every_save",
"hub_token": "<HUB_TOKEN>",
"hub_private_repo": false,
"hub_always_push": false,
"gradient_checkpointing": false,
"gradient_checkpointing_kwargs": null,
"include_inputs_for_metrics": false,
"eval_do_concat_batches": true,
"fp16_backend": "auto",
"push_to_hub_model_id": null,
"push_to_hub_organization": null,
"push_to_hub_token": "<PUSH_TO_HUB_TOKEN>",
"mp_parameters": "",
"auto_find_batch_size": false,
"full_determinism": false,
"torchdynamo": null,
"ray_scope": "last",
"ddp_timeout": 1800,
"torch_compile": false,
"torch_compile_backend": null,
"torch_compile_mode": null,
"dispatch_batches": null,
"split_batches": null,
"include_tokens_per_second": false,
"include_num_input_tokens_seen": false,
"neftune_noise_alpha": null,
"optim_target_modules": null
}
I am trying to recreate the
StarCoder2-Instruct-v0.1
model; however, the model produced by the provided command in the README (copied below) does not match the evaluation of theStarCoder2-Instruct-v0.1
model on HF.I actually see quite a bit of discrepancy between the two models' evaluations:
humaneval
on your HF version is 7 points higher than on my reproduced model (both models were evaluated locally by me in the same environment).Are the parameters in the README correct for the released model? Are you adding anything in your
accelerate
config? i.e. any model wrappers or something else?For the data, I just ran:
Do you have any ideas on how I can reproduce your model? Thanks!