VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

NGT performance #29

Open VarIr opened 4 years ago

VarIr commented 4 years ago

Approx. neighbor search with ngtpy can be accelerated:

VarIr commented 4 years ago

It seems ONNG can be enabled in ngtpy, but it is currently not documented. However, there is an example here: https://github.com/yahoojapan/NGT/issues/30

VarIr commented 4 years ago

New NGT release 1.7.10 should fix this: https://github.com/yahoojapan/NGT/releases/tag/v1.7.10

VarIr commented 4 years ago

1.8.0 brought docs for ONNG. It is already activate here, but index building is extremely slow due to difficult parameterization. Need to check.

jaytimbadia commented 3 years ago

Approx. neighbor search with ngtpy can be accelerated:

  • [x] Enable AVX on MacOS (temporarily disabled due to upstream bug in NGT. It is already enabled on Linux.)
  • [x] Use NGT's optimization step (until then, the method is actually (P)ANNG, not ONNG, I assume). Currently, this seems to be possible only via the cmd line tools, not via the Python API.
  • [ ] Set good default parameters for ONNG

Hi, Seems like really good work.

I am using bert to find semantic similarity using cosine distance, but it may lead to high dimension problem. So can I use hubness here, I mean will it make bert embedding any better?

Thankyou!

VarIr commented 3 years ago

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

jaytimbadia commented 3 years ago

Thanks for your interest. That's something I've been thinking about, but never found time to actually check.

BERT embeddings are typically high-dimensional, so hubness might play a role. You could first estimate the intrinsic dimension of these embeddings (b/c this actually drives hubness), e.g. with this method. If this is much lower than the embedding dimension, it's unlikely that hubness reduction leads to improvements. Alternatively, you could directly compare performance in your tasks with and without hubness reduction. If there's a performance improvement, I'd be curious to know.

Thank you so much for the reply. I calculated the intrinsic dimension for bert and its coming out to be 18, very low than I expected. Anyways, one question, can we use intrinsic dimensionality to check the quality of embeddings we generate? For Eg: bert -> (100, 768) has pretty low, so what does it mean, while some random matrix -> (100, 768) I gave had around 155. So what it means is bert quite well trained?

If yes, we can use this, I mean whenever we generate embeddings we can check its intrinsic dimension if less, so less constraint it has, easier to fine-tune further, right?

I would love to know your thoughts!!

VarIr commented 3 years ago

18 isn't particularly high, but we've seen datasets, where this came with high hubness (see e.g. p. 2885/6 of this previous paper. I am not aware of research directly linking intrinsic dimension to the quality (however this would be defined, anyway) of embeddings. Interesting research questions you pose there :)