Different perplexity when fine-tuning with parallel_model vs 1 gpu

randy-ac commented 2 months ago

Hello,

we have noticed some unexpected behaviors when fine-tuning a llama 3 model on 1 gpu and when fine-tuning the same model on the same data set with 2 gpus in parallel mode. See the attached tensorboard graphs (red=run with parallel mode). The minimal validation ppl is different between the two runs.

As you can see from the configs I am pasting below, the only parameters that differ between the runs are: world_size, gpu_rank and parallel_mode.

Could you please advise?

Configs for run with 1 GPU

# General settings
seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue/logs/

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>｟newline｠｟newline｠"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048

data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"

skip_empty_level: silent # silently ignore empty lines in the data

training:
    world_size: 1
    gpu_ranks: [0]

    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]

    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1

    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"
    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-issue"
    keep_checkpoint: 30
    save_checkpoint_steps: 200

    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false

Configs for run with parallel_mode

seed: 1234
share_vocab: true
save_data: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
src_vocab: "${EOLE_MODEL_DIR}/llama3-8b-instruct/vocab.txt" # size
src_vocab_size: 128256
tgt_vocab_size: 128256

overwrite: true

report_every: 10

n_sample: 0

tensorboard: true
tensorboard_log_dir: /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue/logs/

transforms: [insert_mask_before_placeholder, onmt_tokenize, filtertoolong]

transforms_configs:
    insert_mask_before_placeholder:
        response_patterns: ["<|start_header_id|>assistant<|end_header_id|>｟newline｠｟newline｠"]
    onmt_tokenize:
        src_subword_type: bpe
        src_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        tgt_subword_type: bpe
        tgt_subword_model: "${EOLE_MODEL_DIR}/llama3-8b-instruct/bpe.model"
        gpt2_pretok: true
    filtertoolong:
        src_seq_length: 2048
        tgt_seq_length: 2048

data:
    new_synth_dataset:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_train.shuffle"
        weight: 1 
    valid:
        path_src: "/nas-labs/LM/randy_LLM_exp/new_synthetic_dataset/domain_subdomain_dataset_llama_instruct_no_hierarchy/synthetic-dataset-with-roles_dev.shuffle"

skip_empty_level: silent # silently ignore empty lines in the data

training:
    world_size: 2
    gpu_ranks: [0, 1]
    parallel_mode: "tensor_parallel"
    zero_out_prompt_loss: true

    train_steps: 1000
    valid_steps: 50

    dropout_steps: [0]
    dropout: [0.0]
    attention_dropout: [0.0]
    bucket_size: 10
    num_workers: 1
    batch_type: "sents"
    batch_size: 1
    valid_batch_size: 1
    batch_size_multiple: 1

    compute_dtype: fp16
    apex_opt_level: ""
    optim: "fusedadam"
    learning_rate: 2e-05
    warmup_steps: 100
    decay_method: "none"

    adam_beta2: 0.998
    accum_count: [8, 16]
    accum_steps: [0, 100]
    max_grad_norm: 0
    label_smoothing: 0.0
    param_init: 0
    param_init_glorot: true
    normalization: "tokens"

    train_from: "${EOLE_MODEL_DIR}/llama3-8b-instruct"
    model_path: "/nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-issue"
    keep_checkpoint: 30

    save_checkpoint_steps: 200

    quant_layers: ['gate_up_proj', 'down_proj', 'up_proj'] 
    quant_type: "bnb_NF4"

    lora_layers: ['linear_values', 'linear_query', 'linear_keys', 'final_linear']
    lora_rank: 8 #5 #2
    lora_dropout: 0.05
    lora_alpha: 32
    lora_embedding: false

francoishernandez commented 2 months ago

Hey Randy, I think some variation is expected as there might be some slight differences in some numerical operations, which can build up along the way. It might be interesting to investigate at various steps at the beginning of the training to see which operations are most impactful. Also, could be interesting to check if significant differences in output values occur at inference as well.

vince62s commented 2 months ago

probably unrelated with your issue, nut bear in mind that llama 3 uses Rope scaling which is not implemented in Eole yet.

randy-ac commented 2 months ago

Hello both,

thanks for your replies. I will check your suggestions as soon as possible and will keep you posted.

l-k-11235 commented 2 months ago

Hello, Thanks for your answers. @randy-ac will run and compare the single and parallel gpu mode on a longer setting, without quantization or dropout to avoid “spurious” differences.

randy-ac commented 1 month ago

Hello everyone, I'm finally back to this topic with the results of some experiments we carried out. We trained two models on 1000 steps, one on 2 GPUs with tensor_parallel mode, and another one on a single gpu. The task is domain classification. The two models had exactly the same configs. We removed quantization and dropout to avoid introducing other variables in the experiment. Please see the configs attached. We've still found that the two models diverge in validation accuracy, output values for the same checkpoint, LM decoder forward, checkpoint sizes. In general, the model trained with tensor_parallel seems to achieve a worse performance.

Tensorboard logs (the red one is the one with tensor_parallel)

Output values We tested checkpoint 400 of both models. There is a ~18% difference between the two accuracy values (i.e. if the predicted label = gold label). Please find attached a tsv with the outputs of each model.

Different values in decoder forward For each layer, we printed out the norm of the layer input and the attention output. It seems that some differences start to build up in layer 3 in the first step (i.e. before the first model backward):

1 GPU Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5 attn_output Euclidean Distance to zero: 18.78125

Layer nr 4

Layer_in norm: 391.25

norm_layer_in Euclidean Distance to zero: 620.0

Tensor parallel Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 691.5 attn_output Euclidean Distance to zero: 18.796875

Layer nr 4

Layer_in norm: 391.5

norm_layer_in Euclidean Distance to zero: 620.0 attn_output Euclidean Distance to zero: 24.359375

Checkpoint size The sizes (KB) of the 400th checkpoint for the parallel_mode model are: 5 llama3-8b-instruct-parallel-eole-test-long/step_400/config.json 15700311 llama3-8b-instruct-parallel-eole-test-long/step_400/merged 15397 llama3-8b-instruct-parallel-eole-test-long/step_400/model.00.safetensors 64753 llama3-8b-instruct-parallel-eole-test-long/step_400/optimizer.pt 2069 llama3-8b-instruct-parallel-eole-test-long/step_400/vocab.json

The sizes (KB) of the 400th checkpoint for the 1 gpu model are: 5 llama3-8b-instruct-1gpu-eole-test-long/step_400/config.json 15700324 llama3-8b-instruct-1gpu-eole-test-long/step_400/merged 13349 llama3-8b-instruct-1gpu-eole-test-long/step_400/model.00.safetensors 80137 llama3-8b-instruct-1gpu-eole-test-long/step_400/optimizer.pt 2069 llama3-8b-instruct-1gpu-eole-test-long/step_400/vocab.json

Could you please advise? Thanks!

output.csv tensor_parallel_model_configs.json 1gpu_model_configs.json

vince62s commented 1 month ago

@randy-ac are you seeing this in your log while training on 2GPU/tensor_parallel ?

/home/vincent/miniconda3/envs/pt2.3/lib/python3.11/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905969073/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
/home/vincent/miniconda3/envs/pt2.3/lib/python3.11/site-packages/torch/autograd/graph.py:744: UserWarning: c10d::allreduce_: an autograd kernel was not registered to the Autograd key(s) but we are trying to backprop through it. This may lead to silently incorrect behavior. This behavior is deprecated and will be removed in a future version of PyTorch. If your operator is differentiable, please ensure you have registered an autograd kernel to the correct Autograd key (e.g. DispatchKey::Autograd, DispatchKey::CompositeImplicitAutograd). If your operator is not differentiable, or to squash this warning and use the previous behavior, please register torch::CppFunction::makeFallthrough() to DispatchKey::Autograd. (Triggered internally at /opt/conda/conda-bld/pytorch_1716905969073/work/torch/csrc/autograd/autograd_not_implemented_fallback.cpp:63.)
  return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass

vince62s commented 1 month ago

can you git pull and try again?

randy-ac commented 1 month ago

Hello Vincent. Thanks for the feedback! I confirm I had that UserWarning in my logs when fine-tuning in tensor_parallel mode. I will re-run my tests on #116 and keep you posted.

randy-ac commented 3 weeks ago

Hello,

I re-tested using commit #116 and here are the results.

I confirm I didn't have the UserWarning: c10d::allreduce_: in the logs anymore.

In the Tensor board (the red one is the one fine-tuned in tensor_parallel mode) the two lines are closer to each other than in the previous tests.

However, we still observe the following differences between the model trained on 1 gpu and the one trained on 2 gpus:

Accuracy There is still a difference between the accuracy reached by the two models (i.e. if the predicted label = gold label). The model trained on 1 gpu has an accuracy higher than the tensor_parallel model by ~14%.

Checkpoint size The size (in KB) of the 400th checkpoint of both models are quite different

1 GPU 15700315 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/merged 13349 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/model.00.safetensors 80137 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/optimizer.pt 2069 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-1gpu-eole-test-fix/step_400/vocab.json

Tensor parallel 15700315 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/merged 15397 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/model.00.safetensors 64761 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/optimizer.pt 2069 /nas-labs/LM/randy_LLM_exp/domain_classification_eole/llama3_eole/llama3-8b-instruct-parallel-eole-test-fix/step_400/vocab.json

Different values in decoder forward For each layer, we printed out the norm of the layer input and the attention output. There is still a small difference in layer 3. Parallel mode Layer nr 3

Layer_in norm: 390.0

norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625

Layer nr 3

Layer_in norm: 390.0

norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625

1 gpu Layer nr 3

Layer_in norm: 389.75

norm_layer_in Euclidean Distance to zero: 719.5 attn_output Euclidean Distance to zero: 19.40625

The gap closes up at layer 20. From then on, the size of the tensors are the same for the two modes until the end of the training.

I am attaching the two fine-tuning configs.

Thanks! tensor_parallel_config_latest.json 1gpu_config_latest.json

vince62s commented 3 weeks ago

There might still be a bug or something but not so easy to track unless looking at a step by step calculation on each operation (maybe using a single example and some printouts)

eole-nlp / eole

Different perplexity when fine-tuning with parallel_model vs 1 gpu #86