Open mfirth-truffle opened 1 month ago
For any future people who may stumble across this, just don't use FSDP
Did the model total size bloat / appear much different from the original's size?
Hi @mfirth-truffle
I used a similar configuration file to train the model, and was able to do inference without running into error.
I made sure that my FSDP configurations are the same as your yml.
Here is mine:
base_model: meta-llama/Llama-3.1-8B-Instruct
save_safetensors: true
datasets:
- path: teknium/GPT4-LLM-Cleaned
type: alpaca
dataset_prepared_path: ./last_run_prepared
output_dir: ./outputs/fft-out
sequence_len: 8192
gradient_accumulation_steps: 1
micro_batch_size: 1
num_epochs: 1
optimizer: adamw_torch
learning_rate: 2e-5
bf16: auto
fp16:
tf32: false
logging_steps: 10
xformers_attention:
flash_attention: true
warmup_steps: 10
evals_per_epoch: 2
save_steps: 2
max_steps: 5
weight_decay: 0.0
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_limit_all_gathers: true
fsdp_sync_module_states: true
fsdp_offload_params: false
fsdp_use_orig_params: true
fsdp_cpu_ram_efficient_loading: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
fsdp_state_dict_type: FULL_STATE_DICT
fsdp_sharding_strategy: FULL_SHARD
fsdp_backward_prefetch: BACKWARD_PRE
special_tokens:
pad_token: "<|end_of_text|>"
Perhaps your issue has to do with your (presumably) customised tokenizer config? Would you be able to provide me that so I can dig deeper? Thanks!
Please check that this issue hasn't been reported before.
Expected Behavior
When my model completes and I try to do inference with it it should load without error
Current behaviour
My model is missing parameters and thus errors out when loading
Steps to reproduce
Train a model with my config, and any pre-tokenized dataset, and then try to run it
Config yaml
Possible solution
No response
Which Operating Systems are you using?
Python Version
3.10
axolotl branch-commit
main
Acknowledgements