Cannot reproduce the results of this paper...

coder4nlp commented 9 months ago

ontonotesv5

llama_token_clf, F1:76
unllama_token_clf, F1:75

SeanLee97 commented 9 months ago

Hi @coder4nlp,

We have re-cloned this repository and re-run the experiment using default parameters. The results were satisfactory. In particular, the figure below displays the ontonotesv5 result of "unllama" on epoch 1 with eval_f1=90.19. As the number of epochs increases, the results will continue to improve. The performance of epoch 1 significantly surpasses the results you run.

To reproduce our results, please ensure that you run the correct file for "unllama" and use our default parameters.

unllama_token_clf

csroyli commented 9 months ago

Hi @coder4nlp,

We have also tested the code of unllama for ontonotesv5 on another computer. The results also show good performance after one epoch. The F1 can reach 90%. See if you could provide more details about your implementation so that we can identify the cause. Thanks!

unllama

coder4nlp commented 9 months ago

Thank you for your replies very much.But my log is different. From your log, it seems that the learning rate is set differently? Is your learning rate is 10e-5?

20%|█▉ | 14981/74910 [28:43<1:48:42, 9.19it/s]{'eval_loss': 0.08873207122087479, 'eval_precision': 0.8363836824696803, 'eval_recall': 0.6870132222423474, 'eval_f1': 0.7543754972155926, 'eval_accuracy': 0.9729901217581801, 'eval_runtime': 48.7048, 'eval_samples_per_second': 169.634, 'eval_steps_per_second': 21.209, 'epoch': 1.0} {'loss': 0.087, 'learning_rate': 8.998798558269924e-05, 'epoch': 1.0} {'loss': 0.0763, 'learning_rate': 8.932051795487919e-05, 'epoch': 1.07} {'loss': 0.0779, 'learning_rate': 8.865305032705914e-05, 'epoch': 1.13} {'loss': 0.0718, 'learning_rate': 8.798558269923909e-05, 'epoch': 1.2} {'loss': 0.0791, 'learning_rate': 8.731811507141903e-05, 'epoch': 1.27} {'loss': 0.0783, 'learning_rate': 8.665064744359898e-05, 'epoch': 1.33} {'loss': 0.0786, 'learning_rate': 8.598317981577893e-05, 'epoch': 1.4} {'loss': 0.0751, 'learning_rate': 8.531571218795888e-05, 'epoch': 1.47} {'loss': 0.0745, 'learning_rate': 8.464824456013883e-05, 'epoch': 1.54} {'loss': 0.074, 'learning_rate': 8.398077693231878e-05, 'epoch': 1.6} {'loss': 0.0768, 'learning_rate': 8.331330930449873e-05, 'epoch': 1.67} {'loss': 0.0764, 'learning_rate': 8.264584167667868e-05, 'epoch': 1.74} {'loss': 0.0748, 'learning_rate': 8.197837404885863e-05, 'epoch': 1.8} {'loss': 0.0766, 'learning_rate': 8.131090642103857e-05, 'epoch': 1.87} {'loss': 0.0755, 'learning_rate': 8.064343879321852e-05, 'epoch': 1.94}

csroyli commented 9 months ago

Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.

coder4nlp commented 9 months ago

Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.

I follow the learning rate set in the paper.

We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.

csroyli commented 9 months ago

Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.

I follow the learning rate set in the paper.

We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.

I think the parameter setting would not cause such a big difference. I mean the code to start running the program, which you input in the command prompt. For example, we use

python unllama_token_clf.py ontonotesv5 7b

to initiate the training. Just want to check with you that you are also using python unllama_token_clf.py instead of python llama_token_clf.py.

coder4nlp commented 9 months ago

Than

Hi @coder4nlp! Thanks for your information. May I know how you start running the code, i.e., the command to start the python program? Btw, the learning rate is set to 1e-4 by default.

I follow the learning rate set in the paper.

We set the batch size to 8 and initial learning rate to 8e − 5 using grid search.

I think the parameter setting would not cause such a big difference. I mean the code to start running the program, which you input in the command prompt. For example, we use

python unllama_token_clf.py ontonotesv5 7b

to initiate the training. Just want to check with you that you are also using python unllama_token_clf.py instead of python llama_token_clf.py.

Thank you. I used the same command as you.

CUDA_VISIBLE_DEVICES=1 python unllama_token_clf.py ontonotesv5 7b

csroyli commented 9 months ago

I see. It's interesting and something we have not encountered. Could you please help us confirm if this situation happens every time? I would appreciate it if you could send us the whole training progress for our troubleshooting. Many thanks.

csroyli commented 9 months ago

Another hypothesis is something different from the hardware level. We have tested on 4090, A100, and A800, and have not seen this situation happen. Would be grateful if you could provide us with the device information.

coder4nlp commented 9 months ago

Hi @csroyli .So sad. There seems to be no difference.

You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour You are attempting to use Flash Attention 2.0 with a model initialized on CPU. Make sure to move the model to GPU after initializing it on CPU with model.to('cuda'). handling task ontonotesv5 Loading checkpoint shards: 100%|██████████| 2/2 [00:11<00:00, 5.71s/it] Some weights of UnmaskingLlamaForTokenClassification were not initialized from the model checkpoint at test/Llama-2-7b-hf and are newly initialized: ['classifier.bias', 'classifier.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. trainable params: 6,447,142 || all params: 6,613,790,758 || trainable%: 0.0974802837873511 Map: 100%|██████████| 59924/59924 [00:04<00:00, 14734.61 examples/s] Map: 100%|██████████| 8528/8528 [00:00<00:00, 14905.07 examples/s] Map: 100%|██████████| 8262/8262 [00:00<00:00, 16682.27 examples/s] Detected kernel version 3.10.0, which is below the recommended minimum of 5.5.0; this can cause the process to hang. It is recommended to upgrade the kernel to the minimum version or higher. 0%| | 0/74910 [00:00<?, ?it/s]You're using a LlamaTokenizerFast tokenizer. Please note that with a fast tokenizer, using the __call__ method is faster than using a method to encode the text followed by a call to the pad method to get a padded encoding. 10%|▉ | 7490/74910 [13:15<1:59:13, 9.42it/s]{'loss': 0.3057, 'learning_rate': 9.933253237217995e-05, 'epoch': 0.07} {'loss': 0.161, 'learning_rate': 9.86650647443599e-05, 'epoch': 0.13} {'loss': 0.1376, 'learning_rate': 9.799759711653985e-05, 'epoch': 0.2} {'loss': 0.116, 'learning_rate': 9.73301294887198e-05, 'epoch': 0.27} {'loss': 0.1086, 'learning_rate': 9.666266186089975e-05, 'epoch': 0.33} {'loss': 0.1014, 'learning_rate': 9.59951942330797e-05, 'epoch': 0.4} {'loss': 0.099, 'learning_rate': 9.532772660525965e-05, 'epoch': 0.47} {'loss': 0.098, 'learning_rate': 9.46602589774396e-05, 'epoch': 0.53} {'loss': 0.0926, 'learning_rate': 9.399279134961955e-05, 'epoch': 0.6} {'loss': 0.0888, 'learning_rate': 9.33253237217995e-05, 'epoch': 0.67} {'loss': 0.0905, 'learning_rate': 9.265785609397944e-05, 'epoch': 0.73} {'loss': 0.0929, 'learning_rate': 9.199038846615939e-05, 'epoch': 0.8} {'loss': 0.0923, 'learning_rate': 9.132292083833934e-05, 'epoch': 0.87} {'loss': 0.0893, 'learning_rate': 9.065545321051929e-05, 'epoch': 0.93} 20%|█▉ | 14981/74910 [27:18<1:46:06, 9.41it/s]{'eval_loss': 0.08792682737112045, 'eval_precision': 0.8338434630520333, 'eval_recall': 0.6908168809998189, 'eval_f1': 0.7556215948489351, 'eval_accuracy': 0.9731036647676041, 'eval_runtime': 48.3611, 'eval_samples_per_second': 170.8

csroyli commented 9 months ago

Many thanks @coder4nlp. Just a guess, from the training records I found flash attention is used. Would you please have another attempt with flash attention disabled? We did not use flash attention to accelerate, and are not sure how it will affect the training.

coder4nlp commented 9 months ago

GPU: NVIDIA A100-SXM4-80GB transformers==4.34.1 torch==2.0.1

coder4nlp commented 9 months ago

Hi! @csroyli . Surprised!!!!!!!!!!!!!!!!!!!! When I updated the transformers to version 4.35, the results seemed to become normal!!!!!!!!!!!!!!!!!

20%|██ | 14982/74910 [31:48<1:57:43, 8.48it/s]{'eval_loss': 0.04656170681118965, 'eval_precision': 0.9074004153294317, 'eval_recall': 0.8705850389422206, 'eval_f1': 0.888611573303753, 'eval_accuracy': 0.986174468852481, 'eval_runtime': 55.676, 'eval_samples_per_second': 148.394, 'eval_steps_per_second': 18.554, 'epoch': 1.0} {'loss': 0.0416, 'learning_rate': 8.998798558269924e-05, 'epoch': 1.0} {'loss': 0.0353, 'learning_rate': 8.932051795487919e-05, 'epoch': 1.07} {'loss': 0.0339, 'learning_rate': 8.865305032705914e-05, 'epoch': 1.13} {'loss': 0.0316, 'learning_rate': 8.798558269923909e-05, 'epoch': 1.2} {'loss': 0.0339, 'learning_rate': 8.731811507141903e-05, 'epoch': 1.27} {'loss': 0.0345, 'learning_rate': 8.665064744359898e-05, 'epoch': 1.33}

csroyli commented 9 months ago

Hi @coder4nlp. Great! But I would like to suggest running the file without flash attention env. Possibly this update overwrites some flash attention settings and makes the code work. Flash Attention may make changes at some pretty low level of computing, which may lead to unexpected behavior (just as what the Flash Attention warns: You are attempting to use Flash Attention 2.0 without specifying a torch dtype. This might lead to unexpected behaviour)

coder4nlp commented 9 months ago

@csroyli .Thank you very much! The original transformers I used was modified by others and some operators were added for accelerating. This may be the cause of the problem. Flash Attention 2 is already supported in transformers 4.35.

csroyli commented 9 months ago

@coder4nlp. Thanks for the updates. Glad to see our code works normally with Flash Attention. Please let us know if you have further comments!

4AI / LS-LLaMA

Cannot reproduce the results of this paper... #4