JingXuTHU / Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning

11 stars 0 forks source link

Question #2

Open Bing-a-ling7 opened 1 week ago

Bing-a-ling7 commented 1 week ago

Thanks for your excellent work! I encounter a bug when I run MODEL=facebook/opt-1.3b TASK=RTE EPOCH=5 MODE=random_masking LR=1e-2 MASKING_PROB=0.9999 LOCAL_HOST=0 SEED=0 bash run.sh

Traceback (most recent call last):
  File "/mnt/workspace/code/random-mask/Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning/run.py", line 483, in <module>
    main()
  File "/mnt/workspace/code/random-mask/Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning/run.py", line 470, in main
    framework.train(train_samples, dev_samples if dev_samples is not None else eval_samples)
  File "/mnt/workspace/code/random-mask/Random-Masking-Finds-Winning-Tickets-for-Parameter-Efficient-Fine-tuning/run.py", line 409, in train
    trainer.train(resume_from_checkpoint=last_checkpoint)
  File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2123, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/transformers/trainer.py", line 2516, in _inner_training_loop
    _grad_norm = self.accelerator.clip_grad_norm_(
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2396, in clip_grad_norm_
    self.unscale_gradients()
  File "/usr/local/lib/python3.11/site-packages/accelerate/accelerator.py", line 2340, in unscale_gradients
    self.scaler.unscale_(opt)
  File "/usr/local/lib/python3.11/site-packages/torch/amp/grad_scaler.py", line 338, in unscale_
    optimizer_state["found_inf_per_device"] = self._unscale_grads_(
                                              ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.11/site-packages/torch/amp/grad_scaler.py", line 260, in _unscale_grads_
    raise ValueError("Attempting to unscale FP16 gradients.")
ValueError: Attempting to unscale FP16 gradients.

could you help me with this problem?

JingXuTHU commented 1 week ago

Hi, thanks for the interest in our work. Can you provide the detailed setup, i.e., system & installed package, to reproduce the problem?

Bing-a-ling7 commented 1 week ago

Hi, thanks for the interest in our work. Can you provide the detailed setup, i.e., system & installed package, to reproduce the problem?

Thank you.I fix it. It's because of the version of the transformers. I change it to 4.44.2. And there are another problem. When I run MODEL=facebook/opt-1.3b TASK=RTE EPOCH=5 MODE=random_masking LR=1e-2 MASKING_PROB=0.9999 LOCAL_HOST=0 SEED=0 bash run.sh command. I got info:

2024-11-18 17:38:06,929 - INFO - true masking prob: 0.9993464558919272
2024-11-18 17:38:15,456 - INFO - Train set 0 has 1000 training samples, 500 dev samples, and 277 eval samples
2024-11-18 17:38:15,457 - INFO - Tokenizing training samples...
2024-11-18 17:38:18,603 - INFO - Done with 3.15s
2024-11-18 17:38:19,102 - WARNING - Detected kernel version 4.19.24, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-11-18 17:38:19,310 - INFO - There are 0 training samples and 277 validation samples

It seems like no training at all. Could you explain why?

Bing-a-ling7 commented 1 week ago

And another issue

Why does using your method to mask out 99.9999% of the weight matrix result in memory usage almost the same as that of full fine-tuning (FFT)? I would appreciate your response!

Thank you!

JingXuTHU commented 1 week ago

Hi, thanks for the questions.

Hi, thanks for the interest in our work. Can you provide the detailed setup, i.e., system & installed package, to reproduce the problem?

Thank you.I fix it. It's because of the version of the transformers. I change it to 4.44.2. And there are another problem. When I run MODEL=facebook/opt-1.3b TASK=RTE EPOCH=5 MODE=random_masking LR=1e-2 MASKING_PROB=0.9999 LOCAL_HOST=0 SEED=0 bash run.sh command. I got info:

2024-11-18 17:38:06,929 - INFO - true masking prob: 0.9993464558919272
2024-11-18 17:38:15,456 - INFO - Train set 0 has 1000 training samples, 500 dev samples, and 277 eval samples
2024-11-18 17:38:15,457 - INFO - Tokenizing training samples...
2024-11-18 17:38:18,603 - INFO - Done with 3.15s
2024-11-18 17:38:19,102 - WARNING - Detected kernel version 4.19.24, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-11-18 17:38:19,310 - INFO - There are 0 training samples and 277 validation samples

It seems like no training at all. Could you explain why?

Can you try to change a dataset and see if this message shows again. You can also look at the training log and the test accuracy to see if the training actually happens. A possible explanation is that this line of logs is for test data instead of training data. I can find such a line of log when evaluating the model.

And another issue

Why does using your method to mask out 99.9999% of the weight matrix result in memory usage almost the same as that of full fine-tuning (FFT)? I would appreciate your response!

Thank you!

In my experiment results, the memory cost of Random Masking is far less than FT. Can you provide the detailed setup and memory log, such that I can reproduce it? Also, can you try different datasets and different models to see it happens for all setups. Thanks~

Bing-a-ling7 commented 1 week ago

Hi, thanks for the questions.

Hi, thanks for the interest in our work. Can you provide the detailed setup, i.e., system & installed package, to reproduce the problem?

Thank you.I fix it. It's because of the version of the transformers. I change it to 4.44.2. And there are another problem. When I run MODEL=facebook/opt-1.3b TASK=RTE EPOCH=5 MODE=random_masking LR=1e-2 MASKING_PROB=0.9999 LOCAL_HOST=0 SEED=0 bash run.sh command. I got info:

2024-11-18 17:38:06,929 - INFO - true masking prob: 0.9993464558919272
2024-11-18 17:38:15,456 - INFO - Train set 0 has 1000 training samples, 500 dev samples, and 277 eval samples
2024-11-18 17:38:15,457 - INFO - Tokenizing training samples...
2024-11-18 17:38:18,603 - INFO - Done with 3.15s
2024-11-18 17:38:19,102 - WARNING - Detected kernel version 4.19.24, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher.
2024-11-18 17:38:19,310 - INFO - There are 0 training samples and 277 validation samples

It seems like no training at all. Could you explain why?

Can you try to change a dataset and see if this message shows again. You can also look at the training log and the test accuracy to see if the training actually happens. A possible explanation is that this line of logs is for test data instead of training data. I can find such a line of log when evaluating the model.

And another issue Why does using your method to mask out 99.9999% of the weight matrix result in memory usage almost the same as that of full fine-tuning (FFT)? I would appreciate your response! Thank you!

In my experiment results, the memory cost of Random Masking is far less than FT. Can you provide the detailed setup and memory log, such that I can reproduce it? Also, can you try different datasets and different models to see it happens for all setups. Thanks~

Thank you for your response.I fix it by changing my model to llama and upgrade my transformers version to the latest version.

Bing-a-ling7 commented 1 week ago

I have another question. In your work, your train the mask model and evaluate immediately after training. But I want to save the model after each epoch. But when I loaded from the checkpoint, it shows:

Some weights of the model checkpoint at /mnt/workspace/code/MyExp/output/finetune/best were not used when initializing LlamaForCausalLM: ['model.layers.10.mlp.down_proj.base_Linear.weight', 'model.layers.10.mlp.down_proj.col_indices', 'model.layers.10.mlp.down_proj.row_indices', 'model.layers.10.mlp.down_proj.row_offsets', 'model.layers.10.mlp.down_proj.tunable_weights', 'model.layers.10.mlp.gate_proj.base_Linear.weight', 'model.layers.10.mlp.gate_proj.col_indices', 'model.layers.10.mlp.gate_proj.row_indices', 'model.layers.10.mlp.gate_proj.row_offsets', 'model.layers.10.mlp.gate_proj.tunable_weights', 'model.layers.10.mlp.up_proj.base_Linear.weight', 'model.layers.10.mlp.up_proj.col_indices', 'model.layers.10.mlp.up_proj.row_indices', 'model.layers.10.mlp.up_proj.row_offsets', 'model.layers.10.mlp.up_proj.tunable_weights', 'model.layers.10.self_attn.k_proj.base_Linear.weight', 'model.layers.10.self_attn.k_proj.col_indices', 'model.layers.10.self_attn.k_proj.row_indices', 'model.layers.10.self_attn.k_proj.row_offsets', 'model.layers.10.self_attn.k_proj.tunable_weights', 'model.layers.10.self_attn.o_proj.base_Linear.weight', 'model.layers.10.self_attn.o_proj.col_indices', 'model.layers.10.self_attn.o_proj.row_indices', 'model.layers.10.self_attn.o_proj.row_offsets', 'model.layers.10.self_attn.o_proj.tunable_weights', 'model.layers.10.self_attn.q_proj.base_Linear.weight', 'model.layers.10.self_attn.q_proj.col_indices', 'model.layers.10.self_attn.q_proj.row_indices', 'model.layers.10.self_attn.q_proj.row_offsets', 'model.layers.10.self_attn.q_proj.tunable_weights', 'model.layers.10.self_attn.v_proj.base_Linear.weight', 'model.layers.10.self_attn.v_proj.col_indices', 'model.layers.10.self_attn.v_proj.row_indices', 'model.layers.10.self_attn.v_proj.row_offsets', 'model.layers.10.self_attn.v_proj.tunable_weights', 'model.layers.11.mlp.down_proj.base_Linear.weight', 'model.layers.11.mlp.down_proj.col_indices', 'model.layers.11.mlp.down_proj.row_indices', 'model.layers.11.mlp.down_proj.row_offsets', 'model.layers.11.mlp.down_proj.tunable_weights', 'model.layers.11.mlp.gate_proj.base_Linear.weight', 'model.layers.11.mlp.gate_proj.col_indices', 'model.layers.11.mlp.gate_proj.row_indices', 'model.layers.11.mlp.gate_proj.row_offsets', 'model.layers.11.mlp.gate_proj.tunable_weights', 'model.layers.11.mlp.up_proj.base_Linear.weight', 'model.layers.11.mlp.up_proj.col_indices', 'model.layers.11.mlp.up_proj.row_indices', 'model.layers.11.mlp.up_proj.row_offsets', 'model.layers.11.mlp.up_proj.tunable_weights', 'model.layers.11.self_attn.k_proj.base_Linear.weight', 'model.layers.11.self_attn.k_proj.col_indices', 'model.layers.11.self_attn.k_proj.row_indices', 'model.layers.11.self_attn.k_proj.row_offsets', 'model.layers.11.self_attn.k_proj.tunable_weights', 'model.layers.11.self_attn.o_proj.base_Linear.weight', 'model.layers.11.self_attn.o_proj.col_indices', 'model.layers.11.self_attn.o_proj.row_indices', 'model.layers.11.self_attn.o_proj.row_offsets', 'model.layers.11.self_attn.o_proj.tunable_weights', 
...]
- This IS expected if you are initializing LlamaForCausalLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing LlamaForCausalLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).

Seems like not saving the architecture of the mask model. AND my saving method is here:

    unwrapped_model = accelerator.unwrap_model(model)
    unwrapped_model.save_pretrained(save_path, is_main_process=accelerator.is_main_process, save_function=accelerator.save, state_dict=accelerator.get_state_dict(model))
    tokenizer.save_pretrained(save_path)

Can you help me with this?

JingXuTHU commented 2 days ago

Hi, how do you load the saved model? Besides, do you use a customized training pipeline?