IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
244 stars 49 forks source link

Exception in CommunityWalker #121

Open HeikoPaulheim opened 2 years ago

HeikoPaulheim commented 2 years ago

🐛 Bug

CommunityWalker fails with an exception

Current Behavior

This is the message I get:

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\walker.py", line 221, in _proc
    return self._extract(kg, Vertex(entity))  # type: ignore
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\community.py", line 343, in _extract
    for walk in self.extract_walks(kg, entity):
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\community.py", line 328, in extract_walks
    return [walk for walk in fct_search(kg, entity)]
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\community.py", line 234, in _dfs
    pred_obj = self.sampler.sample_hop(
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\samplers\sampler.py", line 161, in sample_hop
    for pred_obj in kg.get_hops(subj, is_reverse)
  File "C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\graphs\kg.py", line 258, in get_hops
    return self._get_hops(vertex, is_reverse)
  File "C:\ProgramData\Anaconda3\lib\site-packages\cachetools\decorators.py", line 70, in wrapper
    return c[k]
  File "C:\ProgramData\Anaconda3\lib\site-packages\cachetools\ttl.py", line 75, in __getitem__
    link = self.__getlink(key)
  File "C:\ProgramData\Anaconda3\lib\site-packages\cachetools\ttl.py", line 205, in __getlink
    value = self.__links[key]
  File "C:\ProgramData\Anaconda3\lib\site-packages\cachetools\keys.py", line 19, in __hash__
    self.__hashvalue = hashvalue = hash(self)
TypeError: unhashable type: 'list'
"""

The above exception was the direct cause of the following exception:

TypeError                                 Traceback (most recent call last)
<ipython-input-4-e6ebb66639bc> in <module>
      1 transformer = RDF2VecTransformer(walkers=walkers, embedder=Word2Vec(sg=1, vector_size=50, hs=1, window=5, min_count=0))
      2 #transformer = RDF2VecTransformer(walkers=walkers, embedder=Word2Vec())
----> 3 embeddings,_ = transformer.fit_transform(kg, entities)

C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\rdf2vec.py in fit_transform(self, kg, entities, is_update)
    141         """
    142         self._is_extract_walks_literals = True
--> 143         self.fit(self.get_walks(kg, entities), is_update)
    144         return self.transform(kg, entities)
    145 

C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\rdf2vec.py in get_walks(self, kg, entities)
    176         tic = time.perf_counter()
    177         for walker in self.walkers:
--> 178             walks += walker.extract(kg, entities, self.verbose)
    179         toc = time.perf_counter()
    180 

C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\community.py in extract(self, kg, entities, verbose)
    303         """
    304         self._community_detection(kg)
--> 305         return super().extract(kg, entities, verbose)
    306 
    307     def extract_walks(self, kg: KG, entity: Vertex) -> List[Walk]:

C:\ProgramData\Anaconda3\lib\site-packages\pyrdf2vec\walkers\walker.py in extract(self, kg, entities, verbose)
    153 
    154         with multiprocessing.Pool(process, self._init_worker, [kg]) as pool:
--> 155             res = list(
    156                 tqdm(
    157                     pool.imap(self._proc, entities),

C:\ProgramData\Anaconda3\lib\site-packages\tqdm\std.py in __iter__(self)
   1171         # (note: keep this check outside the loop for performance)
   1172         if self.disable:
-> 1173             for obj in iterable:
   1174                 yield obj
   1175             return

C:\ProgramData\Anaconda3\lib\multiprocessing\pool.py in next(self, timeout)
    866         if success:
    867             return value
--> 868         raise value
    869 
    870     __next__ = next                    # XXX

TypeError: unhashable type: 'list'

Steps to Reproduce

Minimal code snippet:

walker = CommunityWalker(4, 500)
walkers = []
for i in range(1):
    walkers.append(walker)

transformer = RDF2VecTransformer(walkers=walkers, embedder=Word2Vec(sg=1, vector_size=50, hs=1, window=5, min_count=0))
embeddings,_ = transformer.fit_transform(kg, entities)

This example works with all other walk types (HALK, NGram, etc.), but not community walks.

Environment

GillesVandewiele commented 2 years ago

Hi Heiko :wave:

Are you on the most recent version of pyRDF2Vec? I seem to be unable to reproduce the error. However, I see that you only extract 500 (randomly sampled) walks, so it might be that this bug only occurs very sporadically. Could you perhaps add some np.random.seed() on top of your script to make it fully reproducible?

It seems that at some point, a list (instead of a Vertex object) is appended to a walk. The walk thus probably looks somewhat like ([obj], v1, v2, v3, v3, ...), on which it crashes (the double edged sword of Python...). I cannot find it immediately however.

A fix for this, that would also avoid these kind of problems in the future, would be to create a dedicated Walk class instead of working with these tuples which we extend. This Walk class could then have a add_hop method or something similar on which we can do input checking.

class Walk:
  ...
  def add_hop(self, vertex):
    if not isinstance(vertex, Vertex):
      raise Exception

Btw, for smaller datasets, setting the number of walks to None will exhaustively extract all walks, which tends to get better results. (i.e. CommunityWalker(4, None))

HeikoPaulheim commented 2 years ago

Hil Gilles,

I'm using the latest version that I get through pip install (0.2.3), but I haven't rebuilt a more recent version on my own.

Actually, it seems to work with the None option (and yes, it's a fairly small graph), but still, it's a bit of a strange behavior.

Thanks for your help! 😃

GillesVandewiele commented 2 years ago

It's definitely strange behaviour and it is a bug (so leaving this issue open until fixed). It unfortunately seems to only happen very sporadically and only in the DFS (not the BFS), so it will be a very fun bug to solve ;)

Thanks for reporting this btw!