Static embeddings are similar

HazyResearch / bootleg

Self-Supervision for Named Entity Disambiguation at the Tail

http://hazyresearch.stanford.edu/bootleg

Apache License 2.0

213 stars 27 forks source link

Static embeddings are similar #63

Closed dukesun99 closed 3 years ago

dukesun99 commented 3 years ago

When I extract the static embeddings using the code in entity_embedding_tutorial.ipynb, I get mostly the same embedding for all entities (cell 14, in embedding_as_tensor). Mostly the same means the cosine similarity for all entity embeddings are higher than 0.99. I suspect I might have some misunderstanding.

May I know if I want to use entity embeddings from bootleg, is it the correct way to extract it?

Any help is appreciated. Thank you.

lorr1 commented 3 years ago

Hello,

So the static embeddings are trained to be contextualized by the attention layers so the values being similar is an artifact of how they were trained in the model. By themselves, the static entity embeddings are less useful as they are trained to be contextualized.

As generating some static entity embeddings is useful, I have a new tutorial notebook here that shows you how to generate a better static entity embedding. The ideal way is to contextualize it with the sentence, but if you just need the entity embeddings without text, that notebook should help. The notebook basically generates a sentence to feed through Bootleg that represents a single entity. You will need to be on 1.0.4 for it to work.

If you need help running it in batches (which can be faster depending on the entities you need to generate), let me know. I can modify the tutorial to show that.

dukesun99 commented 3 years ago

Hi, thank you for your update. I tested the notebook, and it seems the memory requirement is larger than 40 GB. I need to find a larger server to run the script. May I confirm if the notebook is runnable with 50 GB free memory?

lorr1 commented 3 years ago

Hello,

So running the notebook on my setup has the memory requirements around 53GB. So I'd go with 60GB if possible. Do you need embeddings for all Wikipedia entities? I can also just generate a dump for you and upload it if that'd be useful. If you are adding custom entities, then you'll need to run things.

Let me know.

dukesun99 commented 3 years ago

@lorr1 I would be grateful if you could help to generate a complete dump for me. I found a larger server that is possible to run the notebook but still very slow to generate a complete dump. Appreciate it if you could help! Thanks.

lorr1 commented 3 years ago

Just a quick update. I am still parsing these and should have them in 2 days. There was a slight issue with around 2 entities I had to debug.

lorr1 commented 3 years ago

Hey,

So I have some embeddings ready for you to test out here. I will push new versions of the extractor soon to update how I did it.

Here is pseudo code to get the embeddings.

embs = np.load("ent_embeddings_bootleg_uncased.npy")
qid = "Q52"
ep = EntityProfile.load_from_cache("tutorial_data/data/entity_db")
row_id = ep.get_eid(qid)
emb = embs[row_id]