OpenMOSS / AnyGPT

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
732 stars 56 forks source link

Speech-to-Speech task prompt #32

Open ehosseiniasl opened 1 month ago

ehosseiniasl commented 1 month ago

https://github.com/OpenMOSS/AnyGPT/blame/6404dbafccc10943be6bf6e24a4b99b3a6545501/anygpt/src/m_utils/prompter.py#L45

Hello, Is this line correct? Is this for speech-to-speech conversation? In that case, isn't this the correct prompt:

Speech-Response-Speech': '{speech} Please interpret the user\'s voice commands, provide text responses, and generate corresponding voice replies
JunZhan2000 commented 1 month ago

Hello, part of the prompt in this file was used for debugging. I suggest you refer to this place https://github.com/OpenMOSS/AnyGPT/blame/6404dbafccc10943be6bf6e24a4b99b3a6545501/anygpt/src/m_utils/prompter.py#L113

So actually for voice commands and voice replies, we use the prompt of 'Speech-Instruction'

ehosseiniasl commented 1 month ago

thanks. Did you have direct speech response generation (without text response generation) for base or chat model? which speech response tasks are included in instruction tuning?

ehosseiniasl commented 1 month ago

using Speech-Instruction on chat model, response is as bellow. to_modality=speech Could you please explain what is the first line? : <-Res-> Gmarin misway"- How beautiful you look today! does the model first generates text reply, then speech, even if output modality is speech only?

response:
 :  <-Res-> Gmarin misway"- How beautiful you look today!
  [AnyGPT] "Guhmyayayay!" - How beautiful you look today!  <sosp> <🗣️691> <🗣️691> <🗣️60> <🗣️868> <🗣️868> <🗣️906> <🗣️316> <🗣️1015> <🗣️965> <🗣️512> <🗣️512> <🗣️223> <🗣️223> <🗣️689> <🗣️35> <🗣️35> <🗣️35> <🗣️962> <🗣️57> <🗣️943> <🗣️699> <🗣️1> <🗣️118> <🗣️118> <🗣️118>
ehosseiniasl commented 1 month ago

does the prompt include user speech transcription? the sentence after <-Res-> is the transcription of speech instruction I provided

JunZhan2000 commented 1 month ago

does the prompt include user speech transcription? the sentence after <-Res-> is the transcription of speech instruction I provided

Hello, we provide some training data samples and related descriptions, please refer to https://github.com/OpenMOSS/AnyGPT?tab=readme-ov-file#pretraining-and-sft

JunZhan2000 commented 1 month ago

In the voice dialogue mode, the user provides voice commands, the model recognizes the text commands, generates text replies, and finally generates the voice of the reply.