axolotl-ai-cloud / axolotl

Go ahead and axolotl questions
https://axolotl-ai-cloud.github.io/axolotl/
Apache License 2.0
7.93k stars 875 forks source link

Mlflow duplicate logging #2063

Open jsh2581 opened 3 days ago

jsh2581 commented 3 days ago

Please check that this issue hasn't been reported before.

Expected Behavior

one log in one step

Current behaviour

duplicated log in one step

image

Steps to reproduce

  1. pull docker image : winglian/axolotl:main-20241030-py3.11-cu124-2.4.1
  2. setup mlflow (ghcr.io/mlflow/mlflow:v2.17.2)
  3. run axolotl docker
  4. prepare dataset, base model, train config file.
  5. run accelerate launch -m axolotl.cli.train my_config.yml
  6. go to mlflow logging dir
  7. check the log file

Config yaml

base_model: meta-llama/Llama-3.2-3B
plugins:
  - axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_swiglu: true
liger_fused_linear_cross_entropy: false

strict: false
chat_template:
output_dir: /workspace/axolotl/3_model/pretraining
skip_prepare_dataset: true
datasets:
  - path: /workspace/axolotl/2_data/dataset-tokenized-8k/train
    split: train
    type:

sequence_len: 8192
sample_packing: false
pad_to_sequence_len: false

# mlflow configuration if you're using it
mlflow_tracking_uri: http://mlflow-server:5000
mlflow_experiment_name: llama-3B
mlflow_run_name: llama-3B

gradient_accumulation_steps: 1
micro_batch_size: 2
  # num_epochs: 1
# max_steps: 200000 
optimizer: adamw_torch
lr_scheduler: cosine
lr_scheduler_kwargs:
cosine_min_lr_ratio: 1e-3

learning_rate: 1e-5

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
logging_steps: 1
xformers_attention:
#flash_attention: true

warmup_steps: 20000
  #evals_per_epoch: 2
eval_table_size:

save_steps: 40000
debug:
deepspeed:
weight_decay: 0.0
fsdp:
#   - full_shard
#   - auto_wrap
# fsdp_config:
#   fsdp_limit_all_gathers: true
#   fsdp_sync_module_states: true
#   fsdp_offload_params: false
#   fsdp_use_orig_params: false
#   fsdp_cpu_ram_efficient_loading: true
#   fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
#   fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
#   fsdp_state_dict_type: FULL_STATE_DICT
#   fsdp_sharding_strategy: FULL_SHARD
#   fsdp_backward_prefetch: BACKWARD_PRE
special_tokens:
  pad_token: <|end_of_text|>

Possible solution

No response

Which Operating Systems are you using?

Python Version

3.11

axolotl branch-commit

main/8c3a727f9d60ffd3af385f90bcc3fa3a56398fe1

Acknowledgements

NanoCode012 commented 3 days ago

cc @awhazell , do you perhaps see any duplicate logging recently to mlflow?