Open caufieldjh opened 2 years ago
Thanks @caufieldjh - specifically what we are looking for @LucaCappelletti94 @zommiommy is something like this:
g = Ensmallen.from_csv(**my_graph_params)
my_embedddings = get_okapi_tfidf_weighted_textual_embedding(g)
If I understand correctly (which I might not), the only way to do this now is:
get_okapi_tfidf_weighted_textual_embedding("KGCOVID19") # <- goes to KG-Hub and downloads graph files, gets text from nodes file, and gets embeddings from name and description columns
Hello @justaddcoffee and @caufieldjh, while there are methods already parametrized for the various repositories, the one you have reported here is the most generic one and does not work on graphs, but on generic CSVs. It requires the path of the CSV to parse: you can see its documentation by either using the help
python function or by using the SHIFT+TAB shortcut in a Jupiter Notebook.
Okay, great - thanks @LucaCappelletti94
@caufieldjh can you have a look and see if this provides what we need in NEAT to switch to Grape for text embeddings? I think it should
It looks like it should work, though there is some kind of name collision between Embiggen's transformers
and the transformers
providing the tokenizer:
>>> get_okapi_tfidf_weighted_textual_embedding(path)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/harry/neat-env/lib/python3.8/site-packages/cache_decorator/cache.py", line 613, in wrapped
result = function(*args, **kwargs)
File "/home/harry/neat-env/lib/python3.8/site-packages/ensmallen/datasets/get_okapi_tfidf_weighted_textual_embedding.py", line 88, in get_okapi_tfidf_weighted_textual_embedding
from transformers import AutoTokenizer
ImportError: cannot import name 'AutoTokenizer' from 'transformers' (/home/harry/neat-env/lib/python3.8/site-packages/embiggen/transformers/__init__.py)
That's extremely odd, I'll look into it.
Ok so, I have managed to reproduce it and tried to resolve this collision for a while. This has turned out to be quite cursed, so I will fall-back to the "I'm just going to rename that" option.
I'm thinking about what name could fit that better. It's the submodule that given a node embedding and a graph gets you the edge embedding or any of the likes. A name like graph_processing
seems too vague. Do you have any proposals?
Maybe embedding_transformers
?
I have renamed it for now from transformers
to embedding_transformers
. If we can find a better name, I'm absolutely up for it. At least for now there won't be a collision.
I think that should work fine - at least I can't see a package on Pypi with that name so it shouldn't create the same kind of collision
This issue should be now resolved, @caufieldjh could you confirm?
While updating NEAT to use the most recent grape release, @justaddcoffee and @hrshdhgd and I took a look at what we're using to generate node embeddings based on pretrained word embeddings like BERT etc. : https://github.com/Knowledge-Graph-Hub/NEAT/blob/main/neat/graph_embedding/graph_embedding.py
We know we can run something like
get_okapi_tfidf_weighted_textual_embedding()
on a graph, but is there a more "on demand" way to run this in grape now for an arbitrary graph?