Open acxcv opened 1 year ago
I forgot to mention that the code executes flawlessly with the original entities from countries.py
, regardless of the above changes to rdf2vec.py
.
However, in my example with custom entities, if I use a subset of the custom entities
list, entities[:22]
, a different error occurs:
File "/rdf2vec/r2venv/lib/python3.9/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform
raise ValueError(
ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.
To recap:
countries.py
works in either casecountries.py
code with a custom entities
list (len 31
) causes the above IndexError
s. The exact error depends on whether attr[pos] = ...
has been modified or notcountries.py
code from above with a shorter entities
list (len 22
) like in countries.py
, causes
IndexError
, depending on whether the line from Part 1 has been modified or not OR ValueError
, with the changes from Part 1 and 2 Does anybody know what's going on here?
I don't have much bandwidth to look at this atm. But does padding the list with some dummy entities fix the issue?
Do your entities occur in the KG? Shouldn't it be https
instead of http
for instance? Maybe test if you can extract walks for a single entity?
Hi Gilles,
Thanks for your reply.
The problem was simply that there were duplicates in the entities
list.
Ok thanks for the update! I will re-open the issue however as that is something we could detect for users and raise a more clear error!
Hi!
Any updates on this subject? I am running into similar issues. The relevant portion of the my code is as follows:
55 def fit_embedding(transformer, knowledge_graph, nodes, epochs_list, rep, sub_dir):
56 """
57
58 :param transformer: The RDF2VecTransformer used for making the embeddings
59 :param knowledge_graph: Instance of RDF2Vec.Graph that is to be embedded
60 :param nodes: Instances from which an embedding should be made. Should be a list of strings.
61 :param epochs_list: List of epochs at which the embedding should be saved
62 :param rep: The current repetition of the embedding. Sometimes multiple embeddings of the save graph are made, and
63 this is needed for saving the embedding to the right directory
64 :param sub_dir: subdirectory to which the embedding should be saved.
65 :return:
66 """
67 # loss_df = pd.DataFrame(columns=['epoch', 'loss'])
68 print('Starting fitting of word2vec embedding:')
69
70 bar = progressbar.ProgressBar(maxval=max(epochs_list), widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
71 bar.start()
72 walks = transformer.get_walks(knowledge_graph, nodes)
73 for e in range(max(epochs_list)):
74 transformer.embedder.fit(walks, False)
75 if (e+1) in epochs_list:
76 embeddings, literals = transformer.transform(knowledge_graph, nodes)
77 save_embeddings(embeddings, literals, e+1, rep, sub_dir)
78 bar.finish()
79 return
Note that the nodes
object is exactly the same for the transformer.get_walks()
and both transformer.tranform()
calls.
This piece of code produces the following error:
File "/create_embedding.py", line 76, in fit_embedding
embeddings, literals = transformer.transform(knowledge_graph, nodes)
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/rdf2vec.py", line 214, in transform
embeddings = self.embedder.transform(entities)
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform
raise ValueError(
ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.
The initialization of the RDF2Vec transformer is done using:
def init_transformer(seed):
# Create our transformer, setting the embedding & walking strategy.
transformer = RDF2VecTransformer(
Word2Vec(epochs=1, workers=10),
walkers=[RandomWalker(4, 10, with_reverse=True, n_jobs=10, random_state=seed)],
verbose=2
)
return transformer
At this point, I'm not sure where to look for a potential cause for this error. Note that when the RandomWalker is initialized with with_reverse=False
, the script runs without throwing any errors (although I have yet to confirm that it produces meaningful embeddings).
Any suggestions are welcome!
🐛 Bug
When trying to create embeddings for a custom list of DBPedia entities using
RDF2VecTransformer.fit_transform
, I'm encountering the following bug inRDF2VecTransformer._update
:Part 1:
File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update
attr[pos] = tmp.pop(self._pos_walks[i]) IndexError: list assignment index out of range
Because
attr[pos] = tmp.pop(self._pos_walks[i]
tries to assign a value to an empty list,attr
, at indexpos
, I tried changing it toattr.insert(pos, tmp.pop(self._pos_walks[i]))
. This populates theattrs
list but then I run into another error:Part 2:
File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update
tmp.pop(self._pos_walks[i]) IndexError: pop index out of range
This happens because
tmp
is a list of length 24, andself._pos_walks[i]
is 25. The for loop in line 271 iterates through the first elements ofself._pos_walks
(6 in my case, all with values lower than 24) and populatesattr
, but fails to continue because it reaches the nonexistent pop indexself._pos_walks[i] = 25
.Steps to Reproduce
entities
infit_transform(kg, entities)
inpyrdf2vec/examples/countries.py
. I used a list of 31 entities as a test case.Environment
Thanks for looking into it!