Objective:

Given that a big graph has been generated, we already know the distribution information (bsbm/model/distrib/*.csv).

When evaluating source selection, we can dynamically decrease the number of sources.

The challenge is to extract a subset of sources, while keeping the same distributional characteristics.

Aproach 1: Using histograms

Query 3:

PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?product ?label
WHERE {
    ?product rdfs:label ?label .
    # const ?ProductType
    ?product a ?ProductType .

    # const ?ProductFeature1
    ?product bsbm:productFeature ?ProductFeature1 .
    ?product bsbm:productPropertyNumeric1 ?p1 .
    # const ?x $ ?p1 
    FILTER ( ?p1 > ?x ) 
    ?product bsbm:productPropertyNumeric3 ?p3 .
    # const ?y $ ?p3
    FILTER (?p3 < ?y )
    OPTIONAL { 
        # const ?ProductFeature2
        ?product bsbm:productFeature ?ProductFeature2 .
        ?product rdfs:label ?testVar 
    }
    FILTER (!bound(?testVar)) 
}
ORDER BY ?label
LIMIT 10

The following query gives the distribution information:

PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT 
    ?ProductType 
    ?ProductFeature1
    (COUNT(DISTINCT ?shop) as ?nbShop)
    (GROUP_CONCAT(DISTINCT ?shop, "|") as ?candidateShop)
WHERE {
    GRAPH ?shop {
        ?product a ?ProductType .
        ?product bsbm:productFeature ?ProductFeature1 .
    }
}
GROUP BY ?ProductType ?ProductFeature1

Using each csv, one can separate into frequent groups:

import numpy as np
import seaborn as sns
import pandas as pd

n_levels = 3 # 0 = rare, 1 = common, 2 = frequent

df = pd.read_csv("q03.csv")
df["candidateShop"] = df['candidateShop'].str.split("|")
df["category"] = pd.cut(df["nbShop"], bins=n_levels, labels=range(n_levels))

print(df["category"].value_counts())

Output:

Group    |(ProductType, ProductFeature)|
2             4094
1              99
0              7
Name: category, dtype: int64

Extracting 50% of the current graph is simply done by grouping and sampling by 0.5:

GDD-Nantes / FedShop

Downscale the big graph to obtain smaller subgraphs of same distribution #2

Objective:

Aproach 1: Using histograms