epoz / shmarql

SPARQL endpoint explorer
The Unlicense
13 stars 2 forks source link

Define whitelist and blacklist for RDF2VEC path walks #15

Open ch-sander opened 2 weeks ago

ch-sander commented 2 weeks ago

As a way to define the semantic core properties (and classes?) that the embeddings will be based on. This allows for better results in the semantic discovery.

ch-sander commented 2 weeks ago

inspired by https://github.com/hassanhajj910/cidoc2vec/tree/master

ch-sander commented 2 weeks ago

How about defining the config as part of a shmarql ontology:

@prefix shmarql: <http://shmarql.com/> .
@prefix grace: <http://graceful17.org/ontology/> .
@prefix owl: <http://www.w3.org/2002/07/owl#> .

# The property for rdf2vec implementation in shmarql
shmarql:vec a owl:ObjectProperty.

# a blacklist of nodes and edges to be ignored (if no whitelist)
shmarql:blacklist [
    a owl:ObjectProperty;
    owl:oneOf (grace:cool_id grace:uuid5)
].

# a whitelist of nodes and edges to be included (blacklist being ignored)
shmarql:whitelist [
    a owl:ObjectProperty;
    owl:oneOf (grace:event grace:entitlement)
].

This in not yet a sound example, I guess. Maybe it's better to define the three as owl:ObjectProperty and apply them to a concrete shmarql instance as subject of the config instead of specifying this in the TBox.

Maybe the two lists could even include/exlude entire namespaces (e.g. PROV or OWL for exclusion)?

For the implementation maybe just change https://github.com/epoz/shmarql/blob/8811fddb8f7f4b104e504437564616cd2f290776/src/app/rdf2vec.py#L55-L58

to something like

    only_subjects = set()
    for s, p, o, _ in triple_func(None, None, None):
        if (whitelist and s in whitelist and o in whitelist and p in whitelist) or \
           (blacklist and s not in blacklist and o not in blacklist and p not in blacklist):
            as_ints.append((nodemap[s], edgemap[p], nodemap[o]))
            only_subjects.add(nodemap[s])

And have a function read in the two lists, in the spirit of this:

from rdflib import Graph, URIRef

def load_lists_from_rdf(rdf_file_path, whitelist_predicate, blacklist_predicate):
    g = Graph()
    g.parse(rdf_file_path, format="turtle")

    whitelist = set()
    blacklist = set()

    # Might need adjustment to the actual definition of the config.ttl
    WHITELIST_PRED = URIRef(whitelist_predicate)
    BLACKLIST_PRED = URIRef(blacklist_predicate)

    for _, _, o in g.triples((None, WHITELIST_PRED, None)):
        whitelist.add(str(o))
    for _, _, o in g.triples((None, BLACKLIST_PRED, None)):
        blacklist.add(str(o))

    return whitelist, blacklist