Closed mhoangvslev closed 1 year ago
Given that a big graph has been generated, we already know the distribution information (bsbm/model/distrib/*.csv).
bsbm/model/distrib/*.csv
When evaluating source selection, we can dynamically decrease the number of sources.
The challenge is to extract a subset of sources, while keeping the same distributional characteristics.
Query 3:
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?product ?label WHERE { ?product rdfs:label ?label . # const ?ProductType ?product a ?ProductType . # const ?ProductFeature1 ?product bsbm:productFeature ?ProductFeature1 . ?product bsbm:productPropertyNumeric1 ?p1 . # const ?x $ ?p1 FILTER ( ?p1 > ?x ) ?product bsbm:productPropertyNumeric3 ?p3 . # const ?y $ ?p3 FILTER (?p3 < ?y ) OPTIONAL { # const ?ProductFeature2 ?product bsbm:productFeature ?ProductFeature2 . ?product rdfs:label ?testVar } FILTER (!bound(?testVar)) } ORDER BY ?label LIMIT 10
The following query gives the distribution information:
PREFIX bsbm: <http://www4.wiwiss.fu-berlin.de/bizer/bsbm/v01/vocabulary/> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> SELECT ?ProductType ?ProductFeature1 (COUNT(DISTINCT ?shop) as ?nbShop) (GROUP_CONCAT(DISTINCT ?shop, "|") as ?candidateShop) WHERE { GRAPH ?shop { ?product a ?ProductType . ?product bsbm:productFeature ?ProductFeature1 . } } GROUP BY ?ProductType ?ProductFeature1
Using each csv, one can separate into frequent groups:
csv
import numpy as np import seaborn as sns import pandas as pd n_levels = 3 # 0 = rare, 1 = common, 2 = frequent df = pd.read_csv("q03.csv") df["candidateShop"] = df['candidateShop'].str.split("|") df["category"] = pd.cut(df["nbShop"], bins=n_levels, labels=range(n_levels)) print(df["category"].value_counts())
Output:
Group |(ProductType, ProductFeature)| 2 4094 1 99 0 7 Name: category, dtype: int64
Extracting 50% of the current graph is simply done by grouping and sampling by 0.5:
0.5
Objective:
Given that a big graph has been generated, we already know the distribution information (
bsbm/model/distrib/*.csv
).When evaluating source selection, we can dynamically decrease the number of sources.
The challenge is to extract a subset of sources, while keeping the same distributional characteristics.
Aproach 1: Using histograms
Query 3:
The following query gives the distribution information:
Using each
csv
, one can separate into frequent groups:Output:
Extracting 50% of the current graph is simply done by grouping and sampling by
0.5
: