LAION-AI / CLAP

Contrastive Language-Audio Pretraining
https://arxiv.org/abs/2211.06687
Creative Commons Zero v1.0 Universal
1.44k stars 139 forks source link

Token level text embeddings #139

Closed gkv91 closed 7 months ago

gkv91 commented 10 months ago

Hi,

Thanks for sharing the work.

Currently, the model.get_text_embedding is giving one embedding (512D) per sentence. How can I extract token level embeddings (i.e., n_tokens x 512D)?

Thanks, Goutham

lukewys commented 8 months ago

Hi, it is possible! You will need to go a bit of hack on your own. Specifically you need to remove the pooler_output in here (and change to a key that corresponds to return shape of BxTxD): https://github.com/LAION-AI/CLAP/blob/main/src/laion_clap/clap_module/model.py#L627