Closed dukesun99 closed 3 years ago
Hello,
So the static embeddings are trained to be contextualized by the attention layers so the values being similar is an artifact of how they were trained in the model. By themselves, the static entity embeddings are less useful as they are trained to be contextualized.
As generating some static entity embeddings is useful, I have a new tutorial notebook here that shows you how to generate a better static entity embedding. The ideal way is to contextualize it with the sentence, but if you just need the entity embeddings without text, that notebook should help. The notebook basically generates a sentence to feed through Bootleg that represents a single entity. You will need to be on 1.0.4 for it to work.
If you need help running it in batches (which can be faster depending on the entities you need to generate), let me know. I can modify the tutorial to show that.
Hi, thank you for your update. I tested the notebook, and it seems the memory requirement is larger than 40 GB. I need to find a larger server to run the script. May I confirm if the notebook is runnable with 50 GB free memory?
Hello,
So running the notebook on my setup has the memory requirements around 53GB. So I'd go with 60GB if possible. Do you need embeddings for all Wikipedia entities? I can also just generate a dump for you and upload it if that'd be useful. If you are adding custom entities, then you'll need to run things.
Let me know.
@lorr1 I would be grateful if you could help to generate a complete dump for me. I found a larger server that is possible to run the notebook but still very slow to generate a complete dump. Appreciate it if you could help! Thanks.
Just a quick update. I am still parsing these and should have them in 2 days. There was a slight issue with around 2 entities I had to debug.
Hey,
So I have some embeddings ready for you to test out here. I will push new versions of the extractor soon to update how I did it.
Here is pseudo code to get the embeddings.
embs = np.load("ent_embeddings_bootleg_uncased.npy")
qid = "Q52"
ep = EntityProfile.load_from_cache("tutorial_data/data/entity_db")
row_id = ep.get_eid(qid)
emb = embs[row_id]
When I extract the static embeddings using the code in entity_embedding_tutorial.ipynb, I get mostly the same embedding for all entities (cell 14, in
embedding_as_tensor
). Mostly the same means the cosine similarity for all entity embeddings are higher than 0.99. I suspect I might have some misunderstanding.May I know if I want to use entity embeddings from bootleg, is it the correct way to extract it?
Any help is appreciated. Thank you.