Closed coder4nlp closed 9 months ago
Hi @coder4nlp,
We have re-cloned this repository and re-run the experiment using default parameters. The results were satisfactory. In particular, the figure below displays the ontonotesv5 result of "unllama" on epoch 1 with eval_f1=90.19. As the number of epochs increases, the results will continue to improve. The performance of epoch 1 significantly surpasses the results you run.
To reproduce our results, please ensure that you run the correct file for "unllama" and use our default parameters.
Hi @coder4nlp,
We have also tested the code of unllama for ontonotesv5 on another computer. The results also show good performance after one epoch. The F1 can reach 90%. See if you could provide more details about your implementation so that we can identify the cause. Thanks!
Thank you for your replies very much.But my log is different. From your log, it seems that the learning rate is set differently? Is your learning rate is 10e-5?
20%|█▉ | 14981/74910 [28:43<1:48:42, 9.19it/s]{'eval_loss': 0.08873207122087479, 'eval_precision': 0.8363836824696803, 'eval_recall': 0.6870132222423474, 'eval_f1': 0.7543754972155926, 'eval_accuracy': 0.9729901217581801, 'eval_runtime': 48.7048, 'eval_samples_per_second': 169.634, 'eval_steps_per_second': 21.209, 'epoch': 1.0} {'loss': 0.087, 'learning_rate': 8.998798558269924e-05, 'epoch': 1.0} {'loss': 0.0763, 'learning_rate': 8.932051795487919e-05, 'epoch': 1.07} {'loss': 0.0779, 'learning_rate': 8.865305032705914e-05, 'epoch': 1.13} {'loss': 0.0718, 'learning_rate': 8.798558269923909e-05, 'epoch': 1.2} {'loss': 0.0791, 'learning_rate': 8.731811507141903e-05, 'epoch': 1.27} {'loss': 0.0783, 'learning_rate': 8.665064744359898e-05, 'epoch': 1.33} {'loss': 0.0786, 'learning_rate': 8.598317981577893e-05, 'epoch': 1.4} {'loss': 0.0751, 'learning_rate': 8.531571218795888e-05, 'epoch': 1.47} {'loss': 0.0745, 'learning_rate': 8.464824456013883e-05, 'epoch': 1.54} {'loss': 0.074, 'learning_rate': 8.398077693231878e-05, 'epoch': 1.6} {'loss': 0.0768, 'learning_rate': 8.331330930449873e-05, 'epoch': 1.67} {'loss': 0.0764, 'learning_rate': 8.264584167667868e-05, 'epoch': 1.74} {'loss': 0.0748, 'learning_rate': 8.197837404885863e-05, 'epoch': 1.8} {'loss': 0.0766, 'learning_rate': 8.131090642103857e-05, 'epoch': 1.87} {'loss': 0.0755, 'learning_rate': 8.064343879321852e-05, 'epoch': 1.94}
Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.
Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.
I follow the learning rate set in the paper.
We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.
Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.
I follow the learning rate set in the paper.
We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.
I think the parameter setting would not cause such a big difference. I mean the code to start running the program, which you input in the command prompt. For example, we use
python unllama_token_clf.py ontonotesv5 7b
to initiate the training. Just want to check with you that you are also using python unllama_token_clf.py
instead of python llama_token_clf.py
.
Than
Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.
I follow the learning rate set in the paper.
We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.
I think the parameter setting would not cause such a big difference. I mean the code to start running the program, which you input in the command prompt. For example, we use
python unllama_token_clf.py ontonotesv5 7b
to initiate the training. Just want to check with you that you are also using
python unllama_token_clf.py
instead ofpython llama_token_clf.py
.
Thank you. I used the same command as you.
CUDA_VISIBLE_DEVICES=1 python unllama_token_clf.py ontonotesv5 7b
I see. It's interesting and something we have not encountered. Could you please help us confirm if this situation happens every time? I would appreciate it if you could send us the whole training progress for our troubleshooting. Many thanks.
Another hypothesis is something different from the hardware level. We have tested on 4090, A100, and A800, and have not seen this situation happen. Would be grateful if you could provide us with the device information.
Hi @csroyli .So sad. There seems to be no difference.
You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with
model.to('cuda')
. handling task ontonotesv5 Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00, 5.71s/it] Some weights of UnmaskingLlamaForTokenClassification were not initialized from the model checkpoint at test/Llama-2-7b-hf and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. trainable params: 6,447,142 || all params: 6,613,790,758 || trainable%: 0.0974802837873511 Map: 100%|██████████| 59924/59924 [00:04<00:00, 14734.61 examples/s] Map: 100%|██████████| 8528/8528 [00:00<00:00, 14905.07 examples/s] Map: 100%|██████████| 8262/8262 [00:00<00:00, 16682.27 examples/s] Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 0%| | 0/74910 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the__call__
method is faster than using a method to encode the text followed by a call to thepad
method to get a padded encoding. 10%|▉ | 7490/74910 [13:15<1:59:13, 9.42it/s]{'loss': 0.3057, 'learning_rate': 9.933253237217995e-05, 'epoch': 0.07} {'loss': 0.161, 'learning_rate': 9.86650647443599e-05, 'epoch': 0.13} {'loss': 0.1376, 'learning_rate': 9.799759711653985e-05, 'epoch': 0.2} {'loss': 0.116, 'learning_rate': 9.73301294887198e-05, 'epoch': 0.27} {'loss': 0.1086, 'learning_rate': 9.666266186089975e-05, 'epoch': 0.33} {'loss': 0.1014, 'learning_rate': 9.59951942330797e-05, 'epoch': 0.4} {'loss': 0.099, 'learning_rate': 9.532772660525965e-05, 'epoch': 0.47} {'loss': 0.098, 'learning_rate': 9.46602589774396e-05, 'epoch': 0.53} {'loss': 0.0926, 'learning_rate': 9.399279134961955e-05, 'epoch': 0.6} {'loss': 0.0888, 'learning_rate': 9.33253237217995e-05, 'epoch': 0.67} {'loss': 0.0905, 'learning_rate': 9.265785609397944e-05, 'epoch': 0.73} {'loss': 0.0929, 'learning_rate': 9.199038846615939e-05, 'epoch': 0.8} {'loss': 0.0923, 'learning_rate': 9.132292083833934e-05, 'epoch': 0.87} {'loss': 0.0893, 'learning_rate': 9.065545321051929e-05, 'epoch': 0.93} 20%|█▉ | 14981/74910 [27:18<1:46:06, 9.41it/s]{'eval_loss': 0.08792682737112045, 'eval_precision': 0.8338434630520333, 'eval_recall': 0.6908168809998189, 'eval_f1': 0.7556215948489351, 'eval_accuracy': 0.9731036647676041, 'eval_runtime': 48.3611, 'eval_samples_per_second': 170.8
Many thanks @coder4nlp. Just a guess, from the training records I found flash attention is used. Would you please have another attempt with flash attention disabled? We did not use flash attention to accelerate, and are not sure how it will affect the training.
GPU: NVIDIA A100-SXM4-80GB transformers==4.34.1 torch==2.0.1
Hi! @csroyli . Surprised!!!!!!!!!!!!!!!!!!!! When I updated the transformers to version 4.35, the results seemed to become normal!!!!!!!!!!!!!!!!!
20%|██ | 14982/74910 [31:48<1:57:43, 8.48it/s]{'eval_loss': 0.04656170681118965, 'eval_precision': 0.9074004153294317, 'eval_recall': 0.8705850389422206, 'eval_f1': 0.888611573303753, 'eval_accuracy': 0.986174468852481, 'eval_runtime': 55.676, 'eval_samples_per_second': 148.394, 'eval_steps_per_second': 18.554, 'epoch': 1.0} {'loss': 0.0416, 'learning_rate': 8.998798558269924e-05, 'epoch': 1.0} {'loss': 0.0353, 'learning_rate': 8.932051795487919e-05, 'epoch': 1.07} {'loss': 0.0339, 'learning_rate': 8.865305032705914e-05, 'epoch': 1.13} {'loss': 0.0316, 'learning_rate': 8.798558269923909e-05, 'epoch': 1.2} {'loss': 0.0339, 'learning_rate': 8.731811507141903e-05, 'epoch': 1.27} {'loss': 0.0345, 'learning_rate': 8.665064744359898e-05, 'epoch': 1.33}
Hi @coder4nlp. Great! But I would like to suggest running the file without flash attention env. Possibly this update overwrites some flash attention settings and makes the code work. Flash Attention may make changes at some pretty low level of computing, which may lead to unexpected behavior (just as what the Flash Attention warns: You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour
)
@csroyli .Thank you very much! The original transformers I used was modified by others and some operators were added for accelerating. This may be the cause of the problem. Flash Attention 2 is already supported in transformers 4.35.
@coder4nlp. Thanks for the updates. Glad to see our code works normally with Flash Attention. Please let us know if you have further comments!
ontonotesv5