atosystem / SpeechCLIP

SpeechCLIP: Integrating Speech with Pre-Trained Vision and Language Model, Accepted to IEEE SLT 2022
https://atosystem.github.io/blogs/speechclip
BSD 3-Clause "New" or "Revised" License
108 stars 6 forks source link

derive embeddings: cascasded models #8

Closed lokesh12345678910 closed 5 months ago

lokesh12345678910 commented 5 months ago

I'm able to use example.py for inference with the base parralel flickr model but I get the following error when I use the cascaded models intead i.e. model_fp = "slt_ckpts/SpeechCLIP/base/flickr/cascaded/epoch_58-step_6902-val_recall_mean_1_7.7700.ckpt" or model_fp = "slt_ckpts/SpeechCLIP/large/flickr/cascaded/epoch_187-step_21995-val_recall_mean_10_62.7700.ckpt" or model_fp = "slt_ckpts/SpeechCLIP/large/coco/cascaded/epoch_12-step_28794-val_recall_mean_10_36.1455.ckpt"

Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP/example.py", line 61, in speechFeatVector_baseFlickrCascasdedModel= baseFlickrCascasdedModel.encode_speech(wav=wav_data)#["cascaded_audio_feat"] File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 1340, inencode_speech cascaded_audio_feat, vq_results, keywords = self.cascaded_branch( File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, *kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 914, in forward audio_feat = self.clip.encode_keywords(keywords, self.keyword_num) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/module/clip_official.py", line 249, in encode_keywords x = self.model.token_embedding(text) File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/modules/module.py", line 1520, in _call_impl return forward_call(*args, **kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/modules/sparse.py", line 163, in forward return F.embedding( File "/work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/functional.py", line 2237, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument index in method wrapper_CUDA__index_select)

lokesh12345678910 commented 5 months ago

Closing as this code worked when both wav_data and model was on cpu

lokesh12345678910 commented 5 months ago

For some reason, this code doesn't work when both the model and audio file is loaded onto cuda. Using the cpu is proving time-consuming so I'd prefer to use the gpu if possible

atosystem commented 5 months ago

@lokesh12345678910 Try adding the following lines in the beginning of encode_speech function

# update device information to clip model
self.clip.update_device(self.device)
lokesh12345678910 commented 5 months ago

It worked on the gpu now, thanks!