IBCNServices / pyRDF2Vec

🐍 Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
244 stars 49 forks source link

SPARQLConnector error when building locally #76

Closed wouterbeek closed 2 years ago

wouterbeek commented 2 years ago

🐛 Bug

When I build this project locally (with pip install .), I get the following error message:

TypeError: unhashable type: 'SPARQLConnector'

I do not get this error message when I install this project remotely (with pip install pyRDF2Vec).

Expected Behavior

No error, regardless of whether I build locally or install from a remote release.

Current Behavior

The above shared error message.

Steps to Reproduce

  1. Clone this repository.
  2. Enter the directory of the resulting close and run pip install .
  3. Run python main.py with the following content in file main.py:
import pandas as pd
from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker
data = pd.read_csv("https://raw.githubusercontent.com/IBCNServices/pyRDF2Vec/master/samples/countries-cities/entities.tsv", sep="\t")
entities = [entity for entity in data["location"]]
knowledge_graph = KG("https://dbpedia.org/sparql")
transformer = RDF2VecTransformer(
    Word2Vec(epochs=10),
    walkers=[RandomWalker(4, 10, with_reverse=False, n_jobs=2)])
embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
  1. Observe the following traceback:
Traceback (most recent call last):
  File "dbpedia.py", line 12, in <module>
    embeddings, literals = transformer.fit_transform(knowledge_graph, entities)
  File "/home/wouter/.local/lib/python3.8/site-packages/pyrdf2vec/rdf2vec.py", line 146, in fit_transform
    self.fit(kg, entities, is_update)
  File "/home/wouter/.local/lib/python3.8/site-packages/pyrdf2vec/rdf2vec.py", line 107, in fit
    walks = self.get_walks(kg, entities)
  File "/home/wouter/.local/lib/python3.8/site-packages/pyrdf2vec/rdf2vec.py", line 166, in get_walks
    if kg.skip_verify is False and not kg.is_exist(entities):
  File "/home/wouter/.local/lib/python3.8/site-packages/pyrdf2vec/graphs/kg.py", line 374, in is_exist
    responses = [self.connector.fetch(query) for query in queries]
  File "/home/wouter/.local/lib/python3.8/site-packages/pyrdf2vec/graphs/kg.py", line 374, in <listcomp>
    responses = [self.connector.fetch(query) for query in queries]
  File "/home/wouter/.local/lib/python3.8/site-packages/cachetools/__init__.py", line 686, in wrapper
    return c[k]
  File "/home/wouter/.local/lib/python3.8/site-packages/cachetools/__init__.py", line 414, in __getitem__
    link = self.__getlink(key)
  File "/home/wouter/.local/lib/python3.8/site-packages/cachetools/__init__.py", line 501, in __getlink
    value = self.__links[key]
  File "/home/wouter/.local/lib/python3.8/site-packages/cachetools/keys.py", line 19, in __hash__
    self.__hashvalue = hashvalue = hash(self)
TypeError: unhashable type: 'SPARQLConnector'

Environment

Additional remarks

I see the same error message in #64, but the problem description there is very different (use of a specific IDE IIUC).

rememberYou commented 2 years ago

This must be due to the use of attr with cachetools. Removing attr.s from the SPARQLConnector and Connector classes should fix the issue.

GillesVandewiele commented 2 years ago

This seems to be indeed related to caching... (the error lowest on the stacktrace comes from site-packages/cachetools/keys.py.

This error is typically thrown in Python when trying to use an unhashable object as a key of a dictionary.

>>> x = [5,3,2]
>>> d = {}
>>> d[x] = 5
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'

However, SPARQLConnector should never be a key in a cache AFAIK? This is something we should perhaps further investigate...

EDIT: I will try to reproduce here locally later, but the fact that the PyPI install works but git clone doesn't probably means we introduced a bug in one of the latest pushes.

bsteenwi commented 2 years ago

Cachemethod is probably defined on a function with a self attribute, so self (here the sparql connector) will be cached as well.

Which throws an error because it doesn't now how to hash it

GillesVandewiele commented 2 years ago

Ok indeed: https://github.com/IBCNServices/pyRDF2Vec/blob/master/pyrdf2vec/connectors.py#L123

But this isn't newly introduced? How come this suddenly bugs out? Is this due to the upgrade from cachetools 4.2.2 to 5.0.0 (#74)? A solution would be to remove cachetools and just keep our own cache (a simple dictionary should suffice).

Screenshot from 2022-04-30 08-53-47

bsteenwi commented 2 years ago

The original idea was to let the user define the cache policy...

Maybe simply defining how to hash the connector object could also resolve this issue.

GillesVandewiele commented 2 years ago

Then we will need to fix the breaking change that is mentioned in #74 (see my screenshot). All unit tests are actually failing as well with this type of errors, so it's definitely because of this.

https://github.com/IBCNServices/pyRDF2Vec/runs/6229466856?check_suite_focus=true

rememberYou commented 2 years ago

The issue is also present with the KG class as we cache the hops with the _get_hops() function.

GillesVandewiele commented 2 years ago

I guess an easy fix for now is to revert the upgrade and put cache tools back at version 4.2.2.

GillesVandewiele commented 2 years ago

I did a reset and push -f to reset our repo back to the latest working commits. All commits after that point in time did get removed however so best store a copy locally if you want to salvage anything from that.

rememberYou commented 2 years ago

I tried to set the fetch and _get_hops functions to global (to avoid the object hash conflict), while using cached, but the error persists. As soon as I delete the cache, then everything works again.

GillesVandewiele commented 2 years ago

Perhaps try the fix suggested by cachetools themselves?

The default key function has been changed to ignore its first argument, so this should only affect applications using custom key functions with the @cachedmethod decorator.

fetch and _get_hops both use a custom key function, we should pass None or something as first argument.

rememberYou commented 2 years ago

I've already tried, but the error is always the same. I will look into fixing this as soon as possible and also update the attr syntax which has changed since then and is easier to use.

rememberYou commented 2 years ago

Last comment for this issue, cachetools 5.0.0 works if we simply remove the key:

@cachedmethod(operator.attrgetter("cache"), key=partial(hashkey, "fetch")))
def foo(self, bar):
    pass

by:

@cachedmethod(operator.attrgetter("cache"))
def foo(self, bar):
    pass

I double check that there are no more problems with cachetools and I push this fix.

GillesVandewiele commented 2 years ago

@rememberYou any update on this fix perhaps (no urgency though, so take your time!)

rememberYou commented 2 years ago

The commit d039a702c9b072983b5f60ba7892b8d7a2cf1993 on the develop branch has fixed the issue with cachetool. If you clone to the master, since you did a git reset with an older version of cachetool, it should work too.

@wouterbeek Does it work on your side?

Using the develop branch, I'm able to extract the walk from the TriplyDB endpoint without issues. I haven't test with the master branch.

wouterbeek commented 2 years ago

@rememberYou Running pip install . on the last commit in master indeed works for me. Thanks for fixing this!