ShampooWang / SpeechCLIP_plus

SpeechCLIP+: Self-supervised multi-task representation learning for speech via CLIP and speech-image data. Accepted to ICASSP 2024, Self-supervision in Audio, Speech, and Beyond (SASB) workshop.
4 stars 0 forks source link

Ordered namespace error #2

Closed lokesh12345678910 closed 7 months ago

lokesh12345678910 commented 7 months ago

(speechCLIP) v330-010.ls6(1009)$ python largePlusImageAudioSim.py ../ART_PPA_WAB_CatRescue/CatRescuePackage.png Cat ../ART_PPA_WAB_CatRescue/SE_PreTx_WAB_CatRescue_WAV/ SE_PreTx_CatRescue Using cache found in /home1/07469/lpugalen/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt 2024-04-08 11:12:02 | INFO | fairseq.tasks.hubert_pretraining | current directoryis /work/07469/lpugalen/ls6/SpeechCLIP 2024-04-08 11:12:02 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': '/checkpoint/wnhsu/data/librivox', 'fine_tuning': False, 'labels': ['lyr9.km500'], 'label_dir': '/checkpoint/wnhsu/experiments/hubert/kmeans_20210121/km_dataset_librivox.model_iter_2.all', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2024-04-08 11:12:02 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': layer_norm, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] 4 + [(512,2,2)] 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': True, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} /work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Normalize waveform = (True) [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Loaded s3prl speech encoder (hubert_large_ll60k): out_dim = 1024 layer_drop = 0.0 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Using weighted sum for all hiddenstates(25) 2024-04-08 11:12:14 | WARNING | avssl.module.clip_official | Reduce text embedding to size of 8112 Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP/largePlusImageAudioSim.py", line 39, in largePlusFlickrCascasdedModel = avssl.model.KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrCascadedModelPath).to(device) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint model = cls._load_model_state(checkpoint, strict=strict, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 198, in _load_model_state model = cls(_cls_kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 1122, ininit super().init(config) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 89, in init self.keyword_num = self.config.model_settings.cascaded_branch.keyword.number File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/base/ordered_namespace.py", line 68, in getattr return super(OrderedNamespace, self).getattribute(key) AttributeError: 'OrderedNamespace' object has no attribute 'number' (speechCLIP) v330-010.ls6(1010)$ python largePlusImageAudioSim.py ../ART_PPA_WAB_CatRescue/CatRescuePackage.png Cat ../ART_PPA_WAB_CatRescue/SE_PreTx_WAB_CatRescue_WAV/ SE_PreTx_CatRescue 2024-04-08 11:14:32 | INFO | avssl.module.speech_encoder_plus | Normalize hidden states (s3prl) Using cache found in /home1/07469/lpugalen/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt 2024-04-08 11:14:33 | INFO | fairseq.tasks.hubert_pretraining | current directoryis /work/07469/lpugalen/ls6/SpeechCLIP 2024-04-08 11:14:33 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': '/checkpoint/wnhsu/data/librivox', 'fine_tuning': False, 'labels': ['lyr9.km500'], 'label_dir': '/checkpoint/wnhsu/experiments/hubert/kmeans_20210121/km_dataset_librivox.model_iter_2.all', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2024-04-08 11:14:33 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': layer_norm, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] 4 + [(512,2,2)] 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': True, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} /work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") 2024-04-08 11:14:38 | INFO | avssl.module.speech_encoder_plus | Normalize waveform = (True) [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. 2024-04-08 11:14:39 | INFO | avssl.module.speech_encoder_plus | Loaded s3prl speech encoder (hubert_large_ll60k): out_dim = 1024 layer_drop = 0.0 2024-04-08 11:14:39 | INFO | avssl.module.speech_encoder_plus | Using weighted sum for all hiddenstates(25) 2024-04-08 11:14:39 | INFO | avssl.module.weighted_sum | Normalize feature beforeweighted sum 2024-04-08 11:14:45 | WARNING | avssl.module.clip_official | Reduce text embedding to size of 8112 Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP/largePlusImageAudioSim.py", line 42, in largePlusFlickrHybridModel = avssl.model.KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrHybridModelPath).to(device) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint model = cls._load_model_state(checkpoint, strict=strict, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 198, in _load_model_state model = cls(_cls_kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 1122, ininit super().init(config) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 89, in init self.keyword_num = self.config.model_settings.cascaded_branch.keyword.number File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/base/ordered_namespace.py", line 68, in getattr return super(OrderedNamespace, self).getattribute(key) AttributeError: 'OrderedNamespace' object has no attribute 'number'

lokesh12345678910 commented 7 months ago

I was using my SpeechCLIP folder, let me try my SpeechCLIP+ folder

ShampooWang commented 7 months ago

Hi,

Thanks for pointing out some issues, I will try to handle them tomorrow, thanks!

On Tue, Apr 9, 2024 at 12:20 AM Lokesha Pugalenthi @.***> wrote:

(speechCLIP) v330-010.ls6(1009)$ python largePlusImageAudioSim.py ../ART_PPA_WAB_CatRescue/CatRescuePackage.png Cat ../ART_PPA_WAB_CatRescue/SE_PreTx_WAB_CatRescue_WAV/ SE_PreTx_CatRescue Using cache found in /home1/07469/lpugalen/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt 2024-04-08 11:12:02 | INFO | fairseq.tasks.hubert_pretraining | current directoryis /work/07469/lpugalen/ls6/SpeechCLIP 2024-04-08 11:12:02 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': '/checkpoint/wnhsu/data/librivox', 'fine_tuning': False, 'labels': ['lyr9.km500'], 'label_dir': '/checkpoint/wnhsu/experiments/hubert/kmeans_20210121/km_dataset_librivox.model_iter_2.all', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2024-04-08 11:12:02 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': layer_norm, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] 4 + [(512,2,2)] 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': True, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} /work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Normalize waveform = (True) [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Loaded s3prl speech encoder (hubert_large_ll60k): out_dim = 1024 layer_drop = 0.0 2024-04-08 11:12:08 | INFO | avssl.module.speech_encoder_plus | Using weighted sum for all hiddenstates(25) 2024-04-08 11:12:14 | WARNING | avssl.module.clip_official | Reduce text embedding to size of 8112 Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP/largePlusImageAudioSim.py", line 39, in largePlusFlickrCascasdedModel = avssl.model.KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrCascadedModelPath).to(device) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint model = cls.

load_model_state(checkpoint, strict=strict, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 198, in _load_model_state model = cls(cls_kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 1122, in__init super().init(config) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 89, in init self.keyword_num = self.config.model_settings.cascaded_branch.keyword.number File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/base/ordered_namespace.py", line 68, in getattr return super(OrderedNamespace, self).getattribute(key) AttributeError: 'OrderedNamespace' object has no attribute 'number' (speechCLIP) v330-010.ls6(1010)$ python largePlusImageAudioSim.py ../ART_PPA_WAB_CatRescue/CatRescuePackage.png Cat ../ART_PPA_WAB_CatRescue/SE_PreTx_WAB_CatRescue_WAV/ SE_PreTx_CatRescue 2024-04-08 11:14:32 | INFO | avssl.module.speech_encoder_plus | Normalize hidden states (s3prl) Using cache found in /home1/07469/lpugalen/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt 2024-04-08 11:14:33 | INFO | fairseq.tasks.hubert_pretraining | current directoryis /work/07469/lpugalen/ls6/SpeechCLIP 2024-04-08 11:14:33 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': '/checkpoint/wnhsu/data/librivox', 'fine_tuning': False, 'labels': ['lyr9.km500'], 'label_dir': '/checkpoint/wnhsu/experiments/hubert/kmeans_20210121/km_dataset_librivox.model_iter_2.all', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2024-04-08 11:14:33 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': layer_norm, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] 4 + [(512,2,2)] 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': True, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} /work/07469/lpugalen/ls6/SpeechCLIP/torch/nn/utils/weight_norm.py:28: UserWarning: torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm. warnings.warn("torch.nn.utils.weight_norm is deprecated in favor of torch.nn.utils.parametrizations.weight_norm.") 2024-04-08 11:14:38 | INFO | avssl.module.speech_encoder_plus | Normalize waveform = (True) [W NNPACK.cpp:64] Could not initialize NNPACK! Reason: Unsupported hardware. 2024-04-08 11:14:39 | INFO | avssl.module.speech_encoder_plus | Loaded s3prl speech encoder (hubert_large_ll60k): out_dim = 1024 layer_drop = 0.0 2024-04-08 11:14:39 | INFO | avssl.module.speech_encoder_plus | Using weighted sum for all hiddenstates(25) 2024-04-08 11:14:39 | INFO | avssl.module.weighted_sum | Normalize feature beforeweighted sum 2024-04-08 11:14:45 | WARNING | avssl.module.clip_official | Reduce text embedding to size of 8112 Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP/largePlusImageAudioSim.py", line 42, in largePlusFlickrHybridModel = avssl.model.KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrHybridModelPath).to(device) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint model = cls.

load_model_state(checkpoint, strict=strict, kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/pytorch_lightning/core/saving.py", line 198, in _load_model_state model = cls(cls_kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 1122, in__init super().init(config) File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/model/kwClip.py", line 89, in init self.keyword_num = self.config.model_settings.cascaded_branch.keyword.number File "/work/07469/lpugalen/ls6/SpeechCLIP/avssl/base/ordered_namespace.py", line 68, in getattr return super(OrderedNamespace, self).getattribute(key) AttributeError: 'OrderedNamespace' object has no attribute 'number'

— Reply to this email directly, view it on GitHub https://github.com/ShampooWang/SpeechCLIP_plus/issues/2, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATNKBTFOWKZZ4IVCJWGKJD3Y4K7WRAVCNFSM6AAAAABF5AZJ4WVHI2DSMVQWIX3LMV43ASLTON2WKOZSGIZTCNRSGUZTSNA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

ShampooWang commented 7 months ago

Ok, if there still have issues, I will try to fix them, thanks!

On Tue, Apr 9, 2024 at 12:23 AM Lokesha Pugalenthi @.***> wrote:

I was using my SpeechCLIP folder, let me try my SpeechCLIP+ folder

— Reply to this email directly, view it on GitHub https://github.com/ShampooWang/SpeechCLIP_plus/issues/2#issuecomment-2043173517, or unsubscribe https://github.com/notifications/unsubscribe-auth/ATNKBTDRPAEFC7B4FADUI7DY4K77PAVCNFSM6AAAAABF5AZJ4WVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANBTGE3TGNJRG4 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

lokesh12345678910 commented 7 months ago

(SpeechCLIP+) login2.ls6(1047)$ python largePlusImageAudioSim.py ../ART_PPA_WAB_CatRescue/CatRescuePackage.png Cat ../ART_PPA_WAB_CatRescue/SE_PreTx_WAB_CatRescue_WAV/ SE_PreTx_CatRescue 2024-04-08 11:53:37 | INFO | numexpr.utils | Note: detected 256 virtual cores but NumExpr set to maximum of 64, check "NUMEXPR_MAX_THREADS" environment variable. 2024-04-08 11:53:37 | INFO | numexpr.utils | Note: NumExpr detected 256 cores but "NUMEXPR_MAX_THREADS" not set, so enforcing safe limit of 8. Using cache found in /home1/07469/lpugalen/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt Traceback (most recent call last): File "/work/07469/lpugalen/ls6/SpeechCLIP_plus/largePlusImageAudioSim.py", line 39, in largePlusFlickrCascasdedModel = avssl.model.KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrCascadedModelPath).to(device) File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 156, in load_from_checkpoint model = cls._load_model_state(checkpoint, strict=strict, kwargs) File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/pytorch_lightning/core/saving.py", line 198, in _load_model_state model = cls(_cls_kwargs) File "/work/07469/lpugalen/ls6/SpeechCLIP_plus/avssl/model/kwClip.py", line 679, in init super().init(config) File "/work/07469/lpugalen/ls6/SpeechCLIP_plus/avssl/model/kwClip.py", line 66, in init self.audio_encoder = FairseqSpeechEncoder_Hubert(**config.audio_encoder) File "/work/07469/lpugalen/ls6/SpeechCLIP_plus/avssl/module/speech_encoderplus.py", line 387, in init model, , task = fairseq.checkpoint_utils.load_model_ensemble_and_task([ckpt]) File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 421, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/fairseq/checkpoint_utils.py", line 315, in load_checkpoint_to_cpu state = torch.load(f, map_location=torch.device("cpu")) File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/torch/serialization.py", line 1005, in load with _open_zipfile_reader(opened_file) as opened_zipfile: File "/home1/07469/lpugalen/.local/lib/python3.9/site-packages/torch/serialization.py", line 457, in init super().init(torch._C.PyTorchFileReader(name_or_buffer)) RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

This was in my SpeechCLIP+ folder. Are the ckpt files valid? I'm pretty sure I fully downloaded them through the bash script

ShampooWang commented 7 months ago

I just run download_ckpts.sh to download checkpoints and use the below codes to load checkpoints.

from avssl.model import KWClip_GeneralTransformer largePlusFlickrCascadedModelPath = "/mnt/md1/user_jeffwang/SpeechCLIP-plus/icassp_sasb_ckpts/SpeechCLIP+/large/flickr/cascaded/model.ckpt" largePlusFlickrCascasdedModel = KWClip_GeneralTransformer.load_from_checkpoint(largePlusFlickrCascadedModelPath).cuda()

Everything looks fine to me, here is the output,

2024-04-09 05:01:20 | WARNING | s3prl.upstream.espnet_hubert.expert | ESPnet is not installed, cannot use espnet_hubert upstream 2024-04-09 05:01:23 | INFO | torch.distributed.nn.jit.instantiator | Created a temporary directory at /tmp/tmpmzder8k9 2024-04-09 05:01:23 | INFO | torch.distributed.nn.jit.instantiator | Writing /tmp/tmpmzder8k9/_remote_module_non_sriptable.py Using cache found in /home/jeffwang/.cache/torch/hub/s3prl_cache/4a54d64fa42b41e39db994c958d8107d5785a100f38c6eba680b6a3cc79babb3 for https://dl.fbaipublicfiles.com/hubert/hubert_large_ll60k.pt 2024-04-09 05:01:25 | INFO | fairseq.tasks.hubert_pretraining | current directory is /mnt/md1/user_jeffwang/SpeechCLIP-plus 2024-04-09 05:01:25 | INFO | fairseq.tasks.hubert_pretraining | HubertPretrainingTask Config {'_name': 'hubert_pretraining', 'data': '/checkpoint/wnhsu/data/librivox', 'fine_tuning': False, 'labels': ['lyr9.km500'], 'label_dir': '/checkpoint/wnhsu/experiments/hubert/kmeans_20210121/km_dataset_librivox.model_iter_2.all', 'label_rate': 50.0, 'sample_rate': 16000, 'normalize': True, 'enable_padding': False, 'max_keep_size': None, 'max_sample_size': 250000, 'min_sample_size': 32000, 'single_target': False, 'random_crop': True, 'pad_audio': False} 2024-04-09 05:01:25 | INFO | fairseq.models.hubert.hubert | HubertModel Config: {'_name': 'hubert', 'label_rate': 50.0, 'extractor_mode': layer_norm, 'encoder_layers': 24, 'encoder_embed_dim': 1024, 'encoder_ffn_embed_dim': 4096, 'encoder_attention_heads': 16, 'activation_fn': gelu, 'layer_type': transformer, 'dropout': 0.0, 'attention_dropout': 0.0, 'activation_dropout': 0.0, 'encoder_layerdrop': 0.0, 'dropout_input': 0.0, 'dropout_features': 0.0, 'final_dim': 768, 'untie_final_proj': True, 'layer_norm_first': True, 'conv_feature_layers': '[(512,10,5)] + [(512,3,2)] 4 + [(512,2,2)] 2', 'conv_bias': False, 'logit_temp': 0.1, 'target_glu': False, 'feature_grad_mult': 1.0, 'mask_length': 10, 'mask_prob': 0.8, 'mask_selection': static, 'mask_other': 0.0, 'no_mask_overlap': False, 'mask_min_space': 1, 'mask_channel_length': 10, 'mask_channel_prob': 0.0, 'mask_channel_selection': static, 'mask_channel_other': 0.0, 'no_mask_channel_overlap': False, 'mask_channel_min_space': 1, 'conv_pos': 128, 'conv_pos_groups': 16, 'latent_temp': [2.0, 0.5, 0.999995], 'skip_masked': False, 'skip_nomask': True, 'checkpoint_activations': False, 'required_seq_len_multiple': 2, 'depthwise_conv_kernel_size': 31, 'attn_type': '', 'pos_enc_type': 'abs', 'fp16': False} 2024-04-09 05:01:31 | INFO | avssl.module.speech_encoder_plus | Normalize waveform = (True) 2024-04-09 05:01:31 | INFO | avssl.module.speech_encoder_plus | Loaded s3prl speech encoder (hubert_large_ll60k): out_dim = 1024 layer_drop = 0.0 2024-04-09 05:01:31 | INFO | avssl.module.speech_encoder_plus | Using weighted sum for all hiddenstates(25) 2024-04-09 05:01:41 | WARNING | avssl.module.clip_official | Reduce text embedding to size of 8112 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | Create Cascaded Branch Plus 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | Using KW_CascadedBranchPlus 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | Using self-attention before downsampling 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | Using MultiheadAttentionAndNorm as KW_CascadedBranchPlus 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | kw_projection dims:[1024, 1024, 768] droupout:0.1 2024-04-09 05:01:42 | INFO | avssl.module.speechclip_c_modules.my_vector_quantizer | Setting vq temp fixed=0.1 2024-04-09 05:01:42 | INFO | avssl.module.speechclip_c_modules.kw_bn | Initialize BatchNorm weight and bias learnable=(True) with token embeddings w/ scale=1.0 2024-04-09 05:01:42 | INFO | avssl.module.cif | Apply scaling strategy step: 5000 2024-04-09 05:01:42 | INFO | avssl.model.kw_branches | Using cif downsampling method

Maybe something goes wrong with the environment? For the version of torch, I am using 1.11.0+cu113.

lokesh12345678910 commented 7 months ago

Yes, this was an installation error on my end, it worked when I tried setting up the environment again from scratch.