Model Parallelization - Githubissues

YuanGongND / ltu

Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".

390 stars 36 forks source link

Model Parallelization #9

Open BhashaBluff opened 11 months ago

BhashaBluff commented 11 months ago

Hie, Thanks for opensourcing this amazing work. Is there any parameter to parallize the model to run on smaller gpus. I was not able to find one in config. As suggested in readme "we should turn on model parallel to train on smaller gpus". Is there any config parameter for it ? Not able to find one.

YuanGongND commented 11 months ago

hi there,

thanks for the question.

1/ what's your GPU setting? for model parallelization, you would need multiple GPUs in a single node.

2/ were you able to run inference? If so, does the result look good? Inference requires less computational resources but basically already implemented model parallelism.

I will add the model parallel training script soon.

-Yuan

BhashaBluff commented 10 months ago

Hey, Thanks for the prompt response. 1.) I was finetuning the model on on V100 station with 4 GPU of 32 GB. I was trying to finetune with device map = "auto" on line 126 of finetune.py in ltu_as. Although it gave a "NotImplementedError: Cannot copy out of meta tensor; no data!"I commented on the device_map line. I started fine-tuning, but it was giving an OOM error.

I am able to make inferences with 2 32GB V100 gpus. The results are not very accurate. However, it is working. wanted to fine-tune the model on the given toy dataset.

YuanGongND commented 10 months ago

Done, please see LTU and LTU-AS. Your resources should be enough to train the model, remember to tune the micro_batch_size to the max number that your GPUs can run.

Regarding the performance, if LTU, it should be exactly same with what we described in the paper, if LTU-AS, it might be a mismatch between training and inference GPUs. Also the model only takes input with 16kHz sampling rate and 10-second audio. You can check if your local inference result is similar to our online demo.

-Yuan

rishabh004-ai commented 10 months ago

Hie, Thanks a lot.

YuanGongND commented 10 months ago

You are welcome, please let me know if there's any issue. Remember to set micro_batch_size larger, it can be something 16/32 or even larger.