Methods for generating node embeddings from word embeddings

AnacletoLAB / grape

🍇 GRAPE is a Rust/Python Graph Representation Learning library for Predictions and Evaluations

MIT License

543 stars 39 forks source link

Methods for generating node embeddings from word embeddings #8

Open caufieldjh opened 2 years ago

caufieldjh commented 2 years ago

While updating NEAT to use the most recent grape release, @justaddcoffee and @hrshdhgd and I took a look at what we're using to generate node embeddings based on pretrained word embeddings like BERT etc. : https://github.com/Knowledge-Graph-Hub/NEAT/blob/main/neat/graph_embedding/graph_embedding.py

We know we can run something like get_okapi_tfidf_weighted_textual_embedding() on a graph, but is there a more "on demand" way to run this in grape now for an arbitrary graph?

justaddcoffee commented 2 years ago

Thanks @caufieldjh - specifically what we are looking for @LucaCappelletti94 @zommiommy is something like this:

g = Ensmallen.from_csv(**my_graph_params)
my_embedddings = get_okapi_tfidf_weighted_textual_embedding(g)

If I understand correctly (which I might not), the only way to do this now is:

get_okapi_tfidf_weighted_textual_embedding("KGCOVID19") # <- goes to KG-Hub and downloads graph files, gets text from nodes file, and gets embeddings from name and description columns

LucaCappelletti94 commented 2 years ago

Hello @justaddcoffee and @caufieldjh, while there are methods already parametrized for the various repositories, the one you have reported here is the most generic one and does not work on graphs, but on generic CSVs. It requires the path of the CSV to parse: you can see its documentation by either using the help python function or by using the SHIFT+TAB shortcut in a Jupiter Notebook.

justaddcoffee commented 2 years ago

Okay, great - thanks @LucaCappelletti94

@caufieldjh can you have a look and see if this provides what we need in NEAT to switch to Grape for text embeddings? I think it should

caufieldjh commented 2 years ago

It looks like it should work, though there is some kind of name collision between Embiggen's transformers and the transformers providing the tokenizer:

>>> get_okapi_tfidf_weighted_textual_embedding(path)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harry/neat-env/lib/python3.8/site-packages/cache_decorator/cache.py", line 613, in wrapped
    result = function(*args, **kwargs)
  File "/home/harry/neat-env/lib/python3.8/site-packages/ensmallen/datasets/get_okapi_tfidf_weighted_textual_embedding.py", line 88, in get_okapi_tfidf_weighted_textual_embedding
    from transformers import AutoTokenizer
ImportError: cannot import name 'AutoTokenizer' from 'transformers' (/home/harry/neat-env/lib/python3.8/site-packages/embiggen/transformers/__init__.py)

LucaCappelletti94 commented 2 years ago

That's extremely odd, I'll look into it.

LucaCappelletti94 commented 2 years ago

Ok so, I have managed to reproduce it and tried to resolve this collision for a while. This has turned out to be quite cursed, so I will fall-back to the "I'm just going to rename that" option.

I'm thinking about what name could fit that better. It's the submodule that given a node embedding and a graph gets you the edge embedding or any of the likes. A name like graph_processing seems too vague. Do you have any proposals?

LucaCappelletti94 commented 2 years ago

Maybe embedding_transformers?

LucaCappelletti94 commented 2 years ago

I have renamed it for now from transformers to embedding_transformers. If we can find a better name, I'm absolutely up for it. At least for now there won't be a collision.

caufieldjh commented 2 years ago

I think that should work fine - at least I can't see a package on Pypi with that name so it shouldn't create the same kind of collision

LucaCappelletti94 commented 2 years ago

This issue should be now resolved, @caufieldjh could you confirm?