castorini / ura-projects

0 stars 1 forks source link

Visualize NFCorpus embeddings #13

Open lintool opened 9 months ago

lintool commented 9 months ago

Building on this: https://github.com/castorini/pyserini/blob/master/docs/experiments-nfcorpus.md

Maybe we can write up a guide on how to visualize the embeddings?

E.g., https://github.com/openai/openai-cookbook/blob/main/examples/Visualizing_embeddings_in_2D.ipynb https://docs.cohere.com/docs/semantic-search

Maybe build a Colab notebook that demonstrates this?

MojTabaa4 commented 9 months ago

Hi @lintool , I've just completed the onboarding path of IR and submitted this PR for it, can I start this project for the next step?

lintool commented 9 months ago

Hi @MojTabaa4 - sure, if you're interested, work on this task!

lintool commented 9 months ago

Additional helpful links: https://github.com/cohere-ai/notebooks#6-visualizing-text-embeddings https://github.com/cohere-ai/notebooks#7-clustering-hacker-news-posts https://github.com/cohere-ai/notebooks#9-topic-modeling-of-ai-papers-in-2022

MojTabaa4 commented 9 months ago

Hi @lintool, I've created this notebook, in which I visualized the embeddings in two ways using Matplotlib and Altair and two methods of dimensionality reduction, t-SNE and UMAP. with regards to the latter method, after visualizing the embeddings, I clustered the data with the K-Means algorithm and used TF-IDF to extract meaningful keywords (highest TF-IDF scores [top 10]) from the documents within each cluster.

The extracted keywords are then used for visualization. Each data point in the chart represents a medical document, and it is color-coded based on the keywords associated with its cluster

lintool commented 9 months ago

Hey @MojTabaa4 take a look at this https://github.com/castorini/ura-projects/issues/2#issuecomment-1732585634

Try visualizing the ACM Fellow citations?