Which feature to use? - Githubissues

SeanLee97 / AnglE

Train and Infer Powerful Sentence Embeddings with AnglE | 🔥 SOTA on STS and MTEB Leaderboard

https://arxiv.org/abs/2309.12871

MIT License

454 stars 32 forks source link

Which feature to use? #19

Closed yuanze1024 closed 9 months ago

yuanze1024 commented 9 months ago

Thank you for your works. I'm new to NLP, and I want to know which feature to use to cluster similar sentences?

After UAE(non retrieval), I'll get a (n, 1024) feature, should I use the starter token's feature same as E5?

And BTW, I found that using E5, "A red teddy bear wearing blue shirt" is very similar to "A blue teddy bear wearing red shirt". Similarly, "A man riding a horse" will be close to "A horse riding a man", is that a problem for all algorithms?

SeanLee97 commented 9 months ago

hi @yuanze1024, thanks for following our work.

1) Right, if you get a (n, 1024) feature, you should take the first one as the sentence embedding. Or you can use our library angle_emb to extract the sentence embeddings, as illustrated in UAE (non-retrieval)

2) I think so, because there are few of these hard cases in existing training datasets. If you want to improve the performance of these hard cases, you should collect more hard data and fine-tune it.

yuanze1024 commented 9 months ago

OK, I see. Thank you for you really quick response.