bug: docker build uses accelerate 0.34.0 which causes crash

HarikrishnanBalagopal commented 2 months ago

Overview

https://github.com/foundation-model-stack/fms-hf-tuning/blob/5c09dbc9d38e9479a7f720e9d6b316243a128343/pyproject.toml#L30

Steps to reproduce

Build the docker image
Doing pip freeze inside the container gives accelerate==0.34.0
Start a training job, an example command is given below:

accelerate launch \
  --use_fsdp \
  --fsdp_auto_wrap_policy=TRANSFORMER_BASED_WRAP \
  --fsdp_forward_prefetch=false \
  --fsdp_offload_params=false \
  --fsdp_sharding_strategy=FULL_SHARD \
  --fsdp_state_dict_type=FULL_STATE_DICT \
  --fsdp_cpu_ram_efficient_loading=true \
  --fsdp_sync_module_states=true \
  --rdzv_backend=static \
  --same_network \
  --num_processes=8 \
  --num_machines=${WORLD_SIZE} \
  --mixed_precision=no \
  --dynamo_backend=no \
  --machine_rank=${RANK} \
  --main_process_ip=${MASTER_ADDR} \
  --main_process_port=${MASTER_PORT} \
  -m tuning.sft_trainer \
  --adam_beta1="0.9" \
  --adam_beta2="0.998" \
  --adam_epsilon="1e-10" \
  --aim_repo="${AIMSTACK_DB}" \
  --data_config_path="dataset_config.yaml" \
  --dataset_text_field="contents" \
  --evaluation_strategy="no" \
  --experiment="experiment1" \
  --gradient_accumulation_steps="2" \
  --gradient_checkpointing="true" \
  --learning_rate="1e-06" \
  --logging_steps="5" \
  --logging_strategy="steps" \
  --lr_scheduler_type="cosine" \
  --max_grad_norm="1" \
  --max_seq_len="8192" \
  --max_steps="12000" \
  --model_name_or_path="ibm-granite/granite-20b-code-base-r1.1" \
  --optim="adamw_torch" \
  --output_dir="/modeling/checkpoints/foo/experiment1" \
  --packing="true" \
  --per_device_train_batch_size="8" \
  --save_steps="500" \
  --save_strategy="steps" \
  --split_batches="true" \
  --torch_dtype="bfloat16" \
  --tracker="aim" \
  --use_flash_attn="true" \
  --use_reentrant="true" \
  --warmup_ratio="0.1" \
  --warmup_steps="500" \
  --weight_decay="0.01" \
  --log_level="debug"

Actual behaviour

INFO:aimstack_tracker.py:Aimstack tracker run hash id dumped to /modeling/checkpoints/foo/experiment1/aimstack_tracker.json
/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py:431: UserWarning: You passed `packing=True` to the SFTTrainer/SFTConfig, and you are training your model with `max_steps` strategy. The dataset will be iterated until the `max_steps` are reached.
  warnings.warn(
Currently training with a batch size of: 8
ERROR:sft_trainer.py:Traceback (most recent call last):
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 582, in main
    trainer = train(
              ^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/tuning/sft_trainer.py", line 381, in train
    trainer.train()
  File "/home/tuning/.local/lib/python3.11/site-packages/trl/trainer/sft_trainer.py", line 450, in train
    output = super().train(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 1938, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/transformers/trainer.py", line 2085, in _inner_training_loop
    self.model = self.accelerator.prepare(self.model)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/accelerator.py", line 1326, in prepare
    result = tuple(
             ^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/accelerator.py", line 1327, in <genexpr>
    self._prepare_one(obj, first_pass=True, device_placement=d) for obj, d in zip(args, device_placement)
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/accelerator.py", line 1200, in _prepare_one
    return self.prepare_model(obj, device_placement=device_placement)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/accelerator.py", line 1468, in prepare_model
    self.state.fsdp_plugin.set_auto_wrap_policy(model)
  File "/home/tuning/.local/lib/python3.11/site-packages/accelerate/utils/dataclasses.py", line 1551, in set_auto_wrap_policy
    raise ValueError(f"Could not find the transformer layer class {layer_class} in the model.")
ValueError: Could not find the transformer layer class G in the model.

Expected behaviour

There should not be any crash during training.

HarikrishnanBalagopal commented 2 months ago

Fixed in https://github.com/foundation-model-stack/fms-hf-tuning/pull/329 by downgrading accelerate to 0.33.0

HarikrishnanBalagopal commented 2 months ago

Fixed in https://github.com/foundation-model-stack/fms-hf-tuning/pull/329 by downgrading accelerate to 0.33.0

The fix was only made to the release branch. The main branch still has the issue https://github.com/foundation-model-stack/fms-hf-tuning/blob/5c09dbc9d38e9479a7f720e9d6b316243a128343/pyproject.toml#L30

willmj commented 1 month ago

Fixed by PR #355

foundation-model-stack / fms-hf-tuning

bug: docker build uses accelerate 0.34.0 which causes crash #332

Overview

Steps to reproduce

Actual behaviour

Expected behaviour