The performance of the model I reproduced does not meet the standards outlined in the paper.

LIN-SHANG / InstructERC

The offical realization of InstructERC

111 stars 7 forks source link

The performance of the model I reproduced does not meet the standards outlined in the paper. #14

Open stddddd opened 1 month ago

stddddd commented 1 month ago

I reproduced the Main Result Reproduction on LoRA + InstructERC based on Llama2, and the performance I got did not meet the paper. The table below is the comparision:

W-F1	IEMOCAP	MELD	EmoryNLP
reproduce	65.47	66.96	39.16
paper	71.39	69.15	41.37

Compared to the original code, I only made the following modifications:

data_percent: 1/64 -> 1
set LLaMA2 MODELPATH to my model path, the Llama2 version I use is Llama-2-7b-chat-hf
While reproducing the code, I met an issue: RuntimeError: probability tensor contains either inf, nan or element < 0.

To solve the problem, I added a code to the Llama2 model file:

probs = nn.functional.softmax(next_token_scores, dim=-1)

nans = torch.isnan(probs)
if nans.any(): 
   idx = torch.argwhere(torch.sum(nans, 1))
   z = torch.zeros_like(probs[idx][0])
   z[0][2] = 1.
   probs[idx] = z

next_tokens = torch.multinomial(probs, num_samples=1).squeeze(1)

What else should I modify to reach the performance mentioned in the paper?

LIN-SHANG commented 1 month ago

The large performance gap is indeed confusing，belowing is something that may help you:

LLaMA version: https://huggingface.co/meta-llama/Llama-2-7b-hf or https://huggingface.co/meta-llama/Llama-2-7b I haven't tried about any version of LLaMA Chat

Besides, I haven't met the RuetimeError you provides before, I can provide related GPU, Nivida Driver and CUDA version:

A100, 470, 11.7

stddddd commented 1 month ago

I reproduced again using your mentioned environment A100, Nvidia Driver 470, and CUDA version 11.7. Besides, I downloaded LLaMA version Llama-2-7b-hf from your produced link: https://huggingface.co/meta-llama/Llama-2-7b-hf.

However, the performance I got still did not meet the paper. The table below is the comparison:

W-F1	IEMOCAP	MELD	EmoryNLP
reproduce	67.53	67.46	39.20
paper	71.39	69.15	41.37

Do you have any idea about it?

The large performance gap is indeed confusing，belowing is something that may help you:

LLaMA version: https://huggingface.co/meta-llama/Llama-2-7b-hf or https://huggingface.co/meta-llama/Llama-2-7b I haven't tried about any version of LLaMA Chat

Besides, I haven't met the RuetimeError you provides before, I can provide related GPU, Nivida Driver and CUDA version:

A100, 470, 11.7

LIN-SHANG commented 1 month ago

It seems that this gap has been reduced a bit, you can try to adjust the historical window (from 5 to 12), this parameter has an impact on the best performance.

stddddd commented 1 month ago

the historical window has already been set to 12 in the previous two reproductions

It seems that this gap has been reduced a bit, you can try to adjust the historical window (from 5 to 12), this parameter has an impact on the best performance.

stddddd commented 1 month ago

This is my hyper-parameter setting while reproducing the model, what should I modify to improve the performance?

hyper-parameter	IEMOCAP/MELD/EmoryNLP
GPU	A100
Nvidia Driver	470
CUDA version	11.7
llm-model	llama-2-7b-hf
experiment setting	lora
historical window	12
accumulations	8
graphics card	4
speaker task	None
domain base	False
emotion prediction	False
data percent	1.0
LR	2e-4
eval batch size	8
num train epochs	6
save steps	100000