Missing Tokenize Audio Info during Fine-tuning/Training

YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

389 stars 36 forks source link

Missing Tokenize Audio Info during Fine-tuning/Training #13

Open dingdongwang opened 10 months ago

dingdongwang commented 10 months ago

It seems missing the tokenize the audio (from 'input_ids') step both in finetune.py/finetune_low_resource.py of the LTU repo. Where is the detailed coding step for audio tokenization? I saw the 'load_audio()' function in inference_batch.py.

YuanGongND commented 10 months ago

They are in https://github.com/YuanGongND/ltu/blob/0fa0923f9c9d04346486a28477ba69b7d957130c/src/ltu/hf-dev/transformers-main/src/transformers/data/data_collator.py#L615-L616 (similar path for LTU-AS).

They cannot be in finetune.py/finetune_low_resource.py because they have to be loaded on-the-fly otherwise there will be an OOM (we cannot put all audios in memory).

-Yuan