csuhan / OneLLM

[CVPR 2024] OneLLM: One Framework to Align All Modalities with Language
Other
546 stars 26 forks source link

Vague output for audio #25

Open lixinghe1999 opened 1 month ago

lixinghe1999 commented 1 month ago

I slightly modify the eval code of audio to run on my dataset, however, the outputs are vague even the audio is speech. There are all like the blow ones:

  1. A device is beeping and it gets louder and louder.
  2. A machine is running and making a high pitched sound.
  3. A machine is running and then stops suddenly.

I attach my code below

def inference_onellm(model, target_dtype, images, modal=['image']):
    if 'imu' in modal:
        inps = ['Describe the motion.'] * len(images)
    if 'audio' in modal:
        inps = ['Provide a one-sentence caption for the provided audio.'] * len(images)
        # inps = ['Provide a one-sentence action description for the provided audio.'] * len(images)
    if 'image' in modal:
        inps = ['Describe the scene.'] * len(images)
    images = images.cuda().to(target_dtype)
    prompts = []
    for inp in inps:
        conv = conv_templates["v1"].copy()        
        conv.append_message(conv.roles[0], inp)
        conv.append_message(conv.roles[1], None)
        prompts.append(conv.get_prompt())

    with torch.cuda.amp.autocast(dtype=target_dtype):
        responses = model.generate(prompts, images, 128, temperature=0.1, top_p=0.75, modal=modal)
        outputs = []
        for response, prompt in zip(responses, prompts):
            response = response[len(prompt):].split('###')[0]
            response = response.strip()
            outputs.append(response)
    return outputs
audio = torch.tensor(make_audio_features('tmp_onellm.wav', mel_bins=128).transpose(0, 1)[None, None])
result_audio = inference_onellm(model, target_dtype, audio, modal=['audio'])
csuhan commented 1 month ago

Hi @lixinghe1999 , our model is mainly trained on natural sound like bird chirping, dog barking and train passing, so it is hard to distinguish human speech. Here are two solutions to enhance it:

lixinghe1999 commented 1 month ago

Thank you for your rapid reply. However, it still outputs meaningless results for other sounds, like musical instrument sounds. Can you give me some hints to solve it? I believe it is not necessary to retrain

Does it possible for the audio duration? Since the IMU duration is fixed to 2 seconds, I also fix the audio duration to 2 seconds

csuhan commented 1 month ago

It may also be related to the sampling length. We sample 1024 frames in total. https://github.com/csuhan/OneLLM/blob/913638c0d385ff706aaed945ec87ee42bab4debb/data/data_utils.py#L81-L86