IBCNServices / pyRDF2Vec

šŸ Python Implementation and Extension of RDF2Vec
https://pyrdf2vec.readthedocs.io/en/latest/
MIT License
244 stars 50 forks source link

Walk option "with_reverse" not honored in remote KG settings #106

Open rgrenz opened 2 years ago

rgrenz commented 2 years ago

šŸ› Bug

Hi, thank you for making this library available to everyone! It is of great use to my university research project. I believe to have spotted a bug concerning the with_reverse walk option:

Expected Behavior

When generating walks using the RandomWalker in combination with the with_reverse = True flag, the returned walks should contain zero or more predecessor triples, followed by the vertice of interest, followed by zero or more successor triples. It should especially be possible to read the returned walks from left to right as a valid traversal on the directed graph. This behavior should not change with the source of the KG.

Current Behavior

When using a local KG, the returned walks are well formed and follow the requirements from above. If the KG instead uses a remote SPARQL source, the resulting walks are no longer legal traversals of the graph. Instead, the generated walks consist of a mirrored successor part, followed by the vertice of interest, followed by another successor part (in correct order).

Steps to Reproduce

from pyrdf2vec import RDF2VecTransformer
from pyrdf2vec.embedders import Word2Vec
from pyrdf2vec.graphs import KG
from pyrdf2vec.walkers import RandomWalker

dbpedia = KG("https://dbpedia.org/sparql")

transformer = RDF2VecTransformer(
    Word2Vec(sg=0, vector_size=10),
    walkers=[RandomWalker(max_walks=1, max_depth=1, with_reverse=True, md5_bytes=None)],
    verbose=1
)

transformer.get_walks(dbpedia, ["http://dbpedia.org/resource/The_Matrix"])

"""
 e.g. [[('http://dbpedia.org/resource/The_Wachowskis',
   'http://dbpedia.org/property/writer',
   'http://dbpedia.org/resource/The_Matrix',
   'http://www.w3.org/1999/02/22-rdf-syntax-ns#type',
   'http://dbpedia.org/class/yago/Wikicat1990sScienceFictionFilms')]]

Notice that the first triple does not exist on DBpedia, only its inverse does.
"""

Environment

Possible Solution

The fetch_hops() function from below should support the with_reverse option, as does its local counterpart _get_hops(). However, this probably also requires modifications to the querying and caching code. https://github.com/IBCNServices/pyRDF2Vec/blob/fb7da659f67b6486a403a46bc2d3c589b802304c/pyrdf2vec/graphs/kg.py#L241-L256

rgrenz commented 2 years ago

Just noticed that this may already be covered by your TODO note in #67.

GillesVandewiele commented 2 years ago

Thank you for reporting this @rgrenz. You are correct that the behaviour of with_reverse seems faulty and should be fixed (something on our roadmap). Unfortunately, bandwidth is rather limited and might take some time. Feel free to open a PR if you'd fix it locally. You are spot on that the fetch_hops needs to be extended to include reverse walking logic, which should use a different SPARQL query (with object rather than subject filled in). I think it can be fixed by extending solely get_query (https://github.com/IBCNServices/pyRDF2Vec/blob/main/pyrdf2vec/connectors.py#L136) and the suggested fetch_hops (the latter should do nothing more than passing on the with_reverse to get_query).