enjalot / latent-scope

A scientific instrument for investigating latent spaces
MIT License
571 stars 19 forks source link

Un-embed clusters as an alternative to summarizing #22

Open enjalot opened 8 months ago

enjalot commented 8 months ago

With vec2txt we should be able to get a reasonably useful sentence out of the average embeddings of a cluster. This could serve as the cluster label, or perhaps as guidance for summarizing the label.

https://github.com/jxmorris12/vec2text/

There are pre-trained models, like for OpenAI's text-embedding-ada-002 and perhaps others. Part of this issue might be helping to pre-train for other supported models in our list.

One could imagine a new API endpoint that takes in an embedding vector and outputs a sentence. We could also have an alternative summarize script that uses this instead (or in conjunction with) summarizing. We currently have a description field per cluster which is not really being used, it could be populated with this or we could add another field.

dhruv-anand-aintech commented 8 months ago

could BerTopic also be a viable alternative to using LLM for summarization/topic name generation?