CentreForDigitalHumanities / I-analyzer

The great textmining tool that obviates all others
https://ianalyzer.hum.uu.nl
MIT License
7 stars 2 forks source link

Store word models in elasticsearch? #1161

Open lukavdplas opened 1 year ago

lukavdplas commented 1 year ago

All word models logic currently happens entirely in python, with vector-related logic handled by gensim.

We might consider storing this data in elasticsearch instead. Basically, you could make a parliament-uk-models index to accompany parliament-uk, with the following fields + field types

You can use the vector field to request, say, the N nearest neighbours based on cosine similarity.

This could work a lot faster than our current approach. Elasticsearch allows you to use an HNSW algorithm which takes some time to index, but saves time during search.

@BeritJanssen , what do you think?

BeritJanssen commented 1 year ago

Given that the current approach is inefficient, but not too slow at the moment, let's keep this around as an idea. It would be quite a time investment, both development and then all indexing operations that need to be triggered.

lukavdplas commented 1 year ago

Agreed, the current loading times are fine. This may be worth it with larger models or new features.

BeritJanssen commented 9 months ago

Looking back at this issue & the Elasticsearch blog, I realize that this approach may be aimed at storing vectors on a per-document, rather than per-word basis. So it could be used for #958 or #584 , but not to show how meaning of single words changed over time.

lukavdplas commented 9 months ago

I think you're probably thinking of this elasticsearch blog post about text similarity search? That description is is indeed more useful for the issues you linked. My proposal here described a rather different setup, I think.

Note that I suggested this:

Basically, you could make a parliament-uk-models index to accompany parliament-uk

My idea was that you create a separate index just for word models, where each document represents a single vector + plus the term and time interval it represents. Within this index, you can make queries for "documents" (but each document in this index represents a term) by making a knn query with a query vector.

So parliament-uk-models would just function as optimised storage for the word models that we know to be trained on parliament-uk, just like we now have a separate storage as a directory word2vec files.

The use case described by the elasticsearch blog is that we would store a vector for each speech in the parliament-uk index, based on word embeddings, transform a query string into a vector, and then use KNN to find texts with high similarity to the query. That would indeed be a completely different functionality from how we use word embeddings now.