YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
390 stars 36 forks source link

Which model is 7B (Default) and which is 13B (Beta)? #6

Open yl4579 opened 11 months ago

yl4579 commented 11 months ago

Are models downloaded from inference.sh 7B (Default) or 13B (Beta)? I found the latter quite error prone and not stable, which is similar to what I'm observing now locally. I think the model is 13B (Beta)? If so, how do I get the 7B (Default) model instead?

YuanGongND commented 11 months ago

hi there,

All models are 7B. The error might be the consistency of GPUs.

*GPU Issue for LTU-AS: We find that Open-AI whisper features are different on different GPUs, which impacts the performance of LTU-AS as it takes the Whisper feature as input. In the paper, we always use features generated by old GPUs (Titan-X). But we do release a checkpoint that uses a feature generated by newer GPUs (A5000/A6000), please manually switch the checkpoint if you are running on old/new GPUs (by default this code uses a new GPU feature). A mismatch of training and inference GPU does not completely destroy the model, but would cause a performance drop.

A good way to test is check if the output is consistent with our online API.

-Yuan

yl4579 commented 11 months ago

The online API is not working right now. If it’s different though, since I’m running inference on A40, how do I get it working in the same way as the API?

YuanGongND commented 11 months ago

You can manually download the model to your path https://github.com/YuanGongND/ltu#pretrained-models, we provide 4 checkpoints.

And then change https://github.com/YuanGongND/ltu/blob/1963db6943bc409e42287bf5b4e6977982999fe2/src/ltu_as/inference_gradio.py#L52

-Yuan

YuanGongND commented 11 months ago

there might be some other reasons, e.g., the sampling rate need to be 16kHz.

yl4579 commented 11 months ago

I just checked the output and I'm pretty sure the default model produces output very similar to 13B (Beta) in the huggingface space (though down now). How do I get the 7B (Default) results?

YuanGongND commented 11 months ago

please upload a sample wav and question. I will check later.

Our MIT GPUs are currently down, will check with our IT.

YuanGongND commented 11 months ago

There are three lora checkpoints, have you tried them all? https://github.com/YuanGongND/ltu#pretrained-models

Also, I restarted the HF space. Can you check if it is consistent with your local model? I am using the same checkpoint ("Long_sequence_exclude_noqa_new_gpu (Default)") as the default checkpoint online.

yl4579 commented 11 months ago

Now I have confirmed they give similar response, but the response is different from those I got a month ago (around early Nov). Did you change the model for your huggingface space?

YuanGongND commented 11 months ago

I do not remember clearly, but we did switch the checkpoint. You can try the "Original in Paper" checkpoint under LTU-AS, https://github.com/YuanGongND/ltu#pretrained-models.

It is an easy switch, just download the checkpoint and change https://github.com/YuanGongND/ltu/blob/1963db6943bc409e42287bf5b4e6977982999fe2/src/ltu_as/inference_gradio.py#L52 to point it to the new checkpoint.

yl4579 commented 11 months ago

In your experience which one is better? I changed to eval_mdl_path = '../../pretrained_mdls/ltu_ori_paper.bin' but got the following error:

RuntimeError                              Traceback (most recent call last)
Cell In[3], line 50
     47 temp, top_p, top_k = 0.1, 0.95, 500
     49 state_dict = torch.load(eval_mdl_path, map_location='cpu')
---> 50 miss, unexpect = model.load_state_dict(state_dict, strict=False)
     52 model.is_parallelizable = True
     53 model.model_parallel = True

File ~/.conda/envs/venv_ltu_as/lib/python3.10/site-packages/torch/nn/modules/module.py:1671, in Module.load_state_dict(self, state_dict, strict)
   1666         error_msgs.insert(
   1667             0, 'Missing key(s) in state_dict: {}. '.format(
   1668                 ', '.join('"{}"'.format(k) for k in missing_keys)))
   1670 if len(error_msgs) > 0:
-> 1671     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   1672                        self.__class__.__name__, "\n\t".join(error_msgs)))
   1673 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for PeftModelForCausalLM:
    size mismatch for base_model.model.model.audio_proj.1.weight: copying a param with shape torch.Size([4096, 768]) from checkpoint, the shape in current model is torch.Size([4096, 1280]).
YuanGongND commented 11 months ago

did you download the one under LTU or LTU-AS?

It is hard to say which is better, it depends on the task.

YuanGongND commented 11 months ago

btw, you can ask multiple questions to the model in one time, but I guess the model performance will be better if you ask one by one. You can tune the prompt for each task, e.g., you can say "give an answer anyways" to force the model to give an answer rather say "I don't know".