YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
337 stars 27 forks source link

train_scripts #37

Open yangdongdong2000 opened 3 weeks ago

yangdongdong2000 commented 3 weeks ago

I notice a lot of sh script in the directory train_scripts, I want to ask what's the difference between them. image

yangdongdong2000 commented 3 weeks ago

Besides, I am confuse about where the audio tokens encoded by whisper appear, it seems that in finutune.py there is nothing relative with whisper.

YuanGongND commented 2 weeks ago

Please check readme.

prep_train download things you need, e.g., model weights etc.

The two "toy" script helps you seting things up before large scale training. We highly recommend to try these first, but you can skip.

Then you should run stage1->2->3 for stage 3 we recommend to run v2.


It is easier to say how it works in the inference code https://github.com/YuanGongND/ltu/blob/main/src/ltu_as/inference_gradio.py. In training, we extract whisper features and save on disk, and just load feature from the disk.

-Yuan

yangdongdong2000 commented 2 weeks ago

Thanks a lot!!!