Hi,sir:
You mentioned “only output text related states ”on line 734 in modeling_llama.py,. And only use text states in the next processing.
On line 733 you did the same thing,only use text states before return the results.
What is the reason for doing this?
I think the reason is, we do not need to apply loss on the output of the model w.r.t. the input audio tokens, we only care about text tokens, and wish to add cross-entropy loss on top of it.
Hi,sir: You mentioned “only output text related states ”on line 734 in modeling_llama.py,. And only use text states in the next processing. On line 733 you did the same thing,only use text states before return the results. What is the reason for doing this?