Open lukavdplas opened 1 year ago
Given that the current approach is inefficient, but not too slow at the moment, let's keep this around as an idea. It would be quite a time investment, both development and then all indexing operations that need to be triggered.
Agreed, the current loading times are fine. This may be worth it with larger models or new features.
Looking back at this issue & the Elasticsearch blog, I realize that this approach may be aimed at storing vectors on a per-document, rather than per-word basis. So it could be used for #958 or #584 , but not to show how meaning of single words changed over time.
I think you're probably thinking of this elasticsearch blog post about text similarity search? That description is is indeed more useful for the issues you linked. My proposal here described a rather different setup, I think.
Note that I suggested this:
Basically, you could make a
parliament-uk-models
index to accompanyparliament-uk
My idea was that you create a separate index just for word models, where each document represents a single vector + plus the term and time interval it represents. Within this index, you can make queries for "documents" (but each document in this index represents a term) by making a knn query with a query vector.
So parliament-uk-models
would just function as optimised storage for the word models that we know to be trained on parliament-uk
, just like we now have a separate storage as a directory word2vec files.
The use case described by the elasticsearch blog is that we would store a vector for each speech in the parliament-uk
index, based on word embeddings, transform a query string into a vector, and then use KNN to find texts with high similarity to the query. That would indeed be a completely different functionality from how we use word embeddings now.
All word models logic currently happens entirely in python, with vector-related logic handled by gensim.
We might consider storing this data in elasticsearch instead. Basically, you could make a
parliament-uk-models
index to accompanyparliament-uk
, with the following fields + field typesdate_start
: datedate_end
: dateterm
: keywordvector
: dense vectorYou can use the vector field to request, say, the N nearest neighbours based on cosine similarity.
This could work a lot faster than our current approach. Elasticsearch allows you to use an HNSW algorithm which takes some time to index, but saves time during search.
@BeritJanssen , what do you think?