IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
242 stars 47 forks source link

Cryptical error msg for duplicates in entities #135

Open acxcv opened 1 year ago

acxcv commented 1 year ago

🐛 Bug

When trying to create embeddings for a custom list of DBPedia entities using RDF2VecTransformer.fit_transform, I'm encountering the following bug in RDF2VecTransformer._update:

Part 1: File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update attr[pos] = tmp.pop(self._pos_walks[i]) IndexError: list assignment index out of range

Because attr[pos] = tmp.pop(self._pos_walks[i] tries to assign a value to an empty list, attr, at index pos, I tried changing it to attr.insert(pos, tmp.pop(self._pos_walks[i])). This populates the attrs list but then I run into another error:

Part 2: File "/r2venv/lib/python3.9/site-packages/pyrdf2vec/rdf2vec.py", line 271, in _update tmp.pop(self._pos_walks[i]) IndexError: pop index out of range

This happens because tmp is a list of length 24, and self._pos_walks[i] is 25. The for loop in line 271 iterates through the first elements of self._pos_walks (6 in my case, all with values lower than 24) and populates attr, but fails to continue because it reaches the nonexistent pop index self._pos_walks[i] = 25.

Steps to Reproduce

  1. Modify entities in fit_transform(kg, entities) in pyrdf2vec/examples/countries.py. I used a list of 31 entities as a test case.
    entities = ['http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/Poems_by_Edgar_Allan_Poe', 'http://dbpedia.org/resource/Post-punk', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Japanese_yen', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/Bono', 'http://dbpedia.org/resource/Revolutionary', 'http://dbpedia.org/resource/Rock_music', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Acoustic_guitar', 'http://dbpedia.org/resource/The_Edge', 'http://dbpedia.org/resource/Rhythm_and_blues', 'http://dbpedia.org/resource/Larry_Mullen_Jr.', 'http://dbpedia.org/resource/Punk_rock', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Live_Aid', 'http://dbpedia.org/resource/The_Joshua_Tree', 'http://dbpedia.org/resource/Music_of_Ireland', 'http://dbpedia.org/resource/Billboard_200', 'http://dbpedia.org/resource/Without_You_(Badfinger_song)', 'http://dbpedia.org/resource/The_Joshua_Tree', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Achtung_Baby', 'http://dbpedia.org/resource/U2', "http://dbpedia.org/resource/All_That_You_Can't_Leave_Behind", 'http://dbpedia.org/resource/Post-punk', 'http://dbpedia.org/resource/U2', 'http://dbpedia.org/resource/Punk_rock', 'http://dbpedia.org/resource/Dublin']
  2. Change line 271 in rdf2vec.py._update as described in part 1
  3. Run your modified version of rdf2vec/examples/countries.py

Environment

Thanks for looking into it!

acxcv commented 1 year ago

I forgot to mention that the code executes flawlessly with the original entities from countries.py, regardless of the above changes to rdf2vec.py.

However, in my example with custom entities, if I use a subset of the custom entities list, entities[:22], a different error occurs:

File "/rdf2vec/r2venv/lib/python3.9/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform raise ValueError( ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector.

To recap:

Does anybody know what's going on here?

GillesVandewiele commented 1 year ago

I don't have much bandwidth to look at this atm. But does padding the list with some dummy entities fix the issue?

Do your entities occur in the KG? Shouldn't it be https instead of http for instance? Maybe test if you can extract walks for a single entity?

acxcv commented 1 year ago

Hi Gilles,

Thanks for your reply.

The problem was simply that there were duplicates in the entities list.

GillesVandewiele commented 1 year ago

Ok thanks for the update! I will re-open the issue however as that is something we could detect for users and raise a more clear error!

Ritten11 commented 1 year ago

Hi!

Any updates on this subject? I am running into similar issues. The relevant portion of the my code is as follows:

55 def fit_embedding(transformer, knowledge_graph, nodes, epochs_list, rep, sub_dir):
56    """
57
58    :param transformer: The RDF2VecTransformer used for making the embeddings
59    :param knowledge_graph: Instance of RDF2Vec.Graph that is to be embedded
60    :param nodes: Instances from which an embedding should be made. Should be a list of strings.
61    :param epochs_list: List of epochs at which the embedding should be saved
62    :param rep: The current repetition of the embedding. Sometimes multiple embeddings of the save graph are made, and
63    this is needed for saving the embedding to the right directory
64    :param sub_dir: subdirectory to which the embedding should be saved.
65    :return:
66    """
67    # loss_df = pd.DataFrame(columns=['epoch', 'loss'])
68    print('Starting fitting of word2vec embedding:')
69
70    bar = progressbar.ProgressBar(maxval=max(epochs_list), widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
71    bar.start()
72    walks = transformer.get_walks(knowledge_graph, nodes)
73    for e in range(max(epochs_list)):
74        transformer.embedder.fit(walks, False)
75        if (e+1) in epochs_list:
76            embeddings, literals = transformer.transform(knowledge_graph, nodes)
77            save_embeddings(embeddings, literals, e+1, rep, sub_dir)
78    bar.finish()
79    return 

Note that the nodes object is exactly the same for the transformer.get_walks() and both transformer.tranform() calls.

This piece of code produces the following error:

File "/create_embedding.py", line 76, in fit_embedding 
embeddings, literals = transformer.transform(knowledge_graph, nodes) 
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/rdf2vec.py", line 214, in transform 
embeddings = self.embedder.transform(entities)    
File "/.pyenv/versions/KRW_project-3.10.4/lib/python3.10/site-packages/pyrdf2vec/embedders/word2vec.py", line 73, in transform
raise ValueError(
ValueError: The entities must have been provided to fit() first before they can be transformed into a numerical vector. 

The initialization of the RDF2Vec transformer is done using:

def init_transformer(seed):
    # Create our transformer, setting the embedding & walking strategy.
    transformer = RDF2VecTransformer(
        Word2Vec(epochs=1, workers=10),
        walkers=[RandomWalker(4, 10, with_reverse=True, n_jobs=10, random_state=seed)],
        verbose=2
    )
    return transformer

At this point, I'm not sure where to look for a potential cause for this error. Note that when the RandomWalker is initialized with with_reverse=False, the script runs without throwing any errors (although I have yet to confirm that it produces meaningful embeddings).

Any suggestions are welcome!