UUDigitalHumanitieslab / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
7 stars 2 forks source link

Store word models in elasticsearch? #1161

Open lukavdplas opened 1 year ago

lukavdplas commented 1 year ago

All word models logic currently happens entirely in python, with vector-related logic handled by gensim.

We might consider storing this data in elasticsearch instead. Basically, you could make a parliament-uk-models index to accompany parliament-uk, with the following fields + field types

You can use the vector field to request, say, the N nearest neighbours based on cosine similarity.

This could work a lot faster than our current approach. Elasticsearch allows you to use an HNSW algorithm which takes some time to index, but saves time during search.

@BeritJanssen , what do you think?

BeritJanssen commented 10 months ago

Given that the current approach is inefficient, but not too slow at the moment, let's keep this around as an idea. It would be quite a time investment, both development and then all indexing operations that need to be triggered.

lukavdplas commented 10 months ago

Agreed, the current loading times are fine. This may be worth it with larger models or new features.

BeritJanssen commented 7 months ago

Looking back at this issue & the Elasticsearch blog, I realize that this approach may be aimed at storing vectors on a per-document, rather than per-word basis. So it could be used for #958 or #584 , but not to show how meaning of single words changed over time.

lukavdplas commented 7 months ago

I think you're probably thinking of this elasticsearch blog post about text similarity search? That description is is indeed more useful for the issues you linked. My proposal here described a rather different setup, I think.

Note that I suggested this:

Basically, you could make a parliament-uk-models index to accompany parliament-uk

My idea was that you create a separate index just for word models, where each document represents a single vector + plus the term and time interval it represents. Within this index, you can make queries for "documents" (but each document in this index represents a term) by making a knn query with a query vector.

So parliament-uk-models would just function as optimised storage for the word models that we know to be trained on parliament-uk, just like we now have a separate storage as a directory word2vec files.

The use case described by the elasticsearch blog is that we would store a vector for each speech in the parliament-uk index, based on word embeddings, transform a query string into a vector, and then use KNN to find texts with high similarity to the query. That would indeed be a completely different functionality from how we use word embeddings now.