RDFLib / prez

Prez is a data-configurable Linked Data API framework that delivers profiles of Knowledge Graph data according to the Content Negotiation by Profile standard.
BSD 3-Clause "New" or "Revised" License
18 stars 7 forks source link

Search MVP default search method with IDN data #153

Open jamiefeiss opened 10 months ago

jamiefeiss commented 10 months ago

Testing the "default" regex search method takes over 30s against the IDN triplestore for the following query:

http://localhost:8000/search?term=open&method=default&limit=10&focus-to-filter[rdf:type]=http%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23Concept&focus-to-filter[skos:inScheme]=https%3A%2F%2Flinked.data.gov.au%2Fdef%2Fdata-access-rights

PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX prez: <https://prez.dev/>
    CONSTRUCT {
    ?hashID a prez:SearchResult ;
        prez:searchResultWeight ?weight ;
        prez:searchResultPredicate ?predicate ;
        prez:searchResultMatch ?match ;
        prez:searchResultURI ?search_result_uri . 
        ?search_result_uri ?p ?o1 .

        ?o1 ?p2 ?o2 .
        ?o2 ?p3 ?o3 .     
}
    WHERE {
        { 
    SELECT ?search_result_uri ?predicate ?match ?weight ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .

        ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>.
?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights>.

      FILTER (
        LCASE(?match) = "open" ||
        REGEX(?match, "^open", "i") ||
        REGEX(?match, "\bopen\b", "i") ||
        REGEX(?match, "open", "i")
      )
      BIND(
        IF(LCASE(?match) = "open", 10,
          IF(REGEX(?match, "^open", "i"), 7,
            IF(REGEX(?match, "\bopen\b", "i"), 5,
              IF(REGEX(?match, "open", "i"), 3, 0)
            )
          )
        ) AS ?weight
      )
    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match), STR(?weight))))) AS ?hashID)
    }
    LIMIT 10
         }
        {
            ?search_result_uri ?p ?o1 . 

                                        OPTIONAL {
                FILTER(ISBLANK(?o1))
                ?o1 ?p2 ?o2 .
                OPTIONAL {
                        FILTER(ISBLANK(?o2))
                        ?o2 ?p3 ?o3 .
                }
        }        }

        UNION {
                    }
    }
jamiefeiss commented 10 months ago

Testing with just the inner SELECT query now, the main issue seems to be that this query searches across all triples. Also, this weighted regex is significantly faster (0.035s vs 29.189s in Fuseki) if we implement something similar to the "skosWeighted" search method - https://github.com/RDFLib/prez/blob/main/prez/reference_data/search_methods/search_skos_weighted.ttl . See below:

SELECT ?search_result_uri ?predicate ?match (SUM(?w) AS ?weight) ?hashID
WHERE {
    ?search_result_uri ?predicate ?match .
    ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
    ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .

    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)

  {
    ?search_result_uri ?predicate ?match .
    BIND (50 AS ?w)
    FILTER (REGEX(?match, "^open$", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (20 AS ?w)
    FILTER (REGEX(?match, "^open", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (10 AS ?w)
    FILTER (REGEX(?match, "open", "i"))
  }
} GROUP BY ?search_result_uri ?predicate ?match ?hashID ORDER BY DESC(?weight) LIMIT 10

Since we'll probably only be searching across labels & descriptions, and returning objects that have endpoints in Prez, we could restrict the predicates that are matched and the base classes of the results to further optimise the query.

recalcitrantsupplant commented 10 months ago

Looks like it's the query structure. Lets see if we can add back in the CONSTRUCT to your performant REGEX above.

For context as well, FTS query below.

http://idn-fuseki-lb-155137521.ap-southeast-2.elb.amazonaws.com:3030/#/dataset/idn/query?query=PREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0APREFIX%20ex%3A%20%3Chttp%3A%2F%2Fwww.example.org%2Fresources%23%3E%0APREFIX%20text%3A%20%3Chttp%3A%2F%2Fjena.apache.org%2Ftext%23%3E%0APREFIX%20sdo%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20%3FMatchURI%20%28COALESCE%28%3Fprop_label%2C%20%3FMatchProp%29%20AS%20%3FMatchProperty%29%20%3FMatchTerm%20%3FSearchTerm%0A%7B%0A%20%20VALUES%20%3FSearchTerm%20%7B%22%2Aopen%2A%22%0A%20%20%7D%0A%20%20%28%3FMatchURI%20%3FWeight%20%3FMatchTerm%20%3Fgraph%20%3FMatchProp%29%20text%3Aquery%20%28%20ex%3ANameProps%20%3FSearchTerm%29%20.%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchURI%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fmatch_label%20.%0A%20%20%7D%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchProp%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fprop_label%20.%0A%20%20%7D%0A%7D

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX ex: <http://www.example.org/resources#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX sdo: <https://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?MatchURI (COALESCE(?prop_label, ?MatchProp) AS ?MatchProperty) ?MatchTerm ?SearchTerm
{
  VALUES ?SearchTerm {"*open*"
  }
  (?MatchURI ?Weight ?MatchTerm ?graph ?MatchProp) text:query ( ex:NameProps ?SearchTerm) .

  OPTIONAL {
    ?MatchURI skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?match_label .
  }

  OPTIONAL {
    ?MatchProp skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?prop_label .
  }
}
recalcitrantsupplant commented 10 months ago

How does this look?

PREFIX prez: <https://prez.dev/>
CONSTRUCT {
  ?hashID a prez:SearchResult ;
    prez:searchResultWeight ?w ;
    prez:searchResultPredicate ?predicate ;
    prez:searchResultMatch ?match ;
    prez:searchResultURI ?search_result_uri . 
  ?search_result_uri ?p ?o1 .
  ?o1 ?p2 ?o2 .
  ?o2 ?p3 ?o3 .     
}
WHERE {
  {
    SELECT ?search_result_uri ?predicate ?match ?w ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .
      ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
      ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .
      BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
      {
        ?search_result_uri ?predicate ?match .
        BIND (50 AS ?w)
        FILTER (REGEX(?match, "^open$", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (20 AS ?w)
        FILTER (REGEX(?match, "^open", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (10 AS ?w)
        FILTER (REGEX(?match, "open", "i"))
      }
    }
    GROUP BY ?search_result_uri ?predicate ?match ?hashID ?w 
    LIMIT 10
  }
  ?search_result_uri ?p ?o1 .
  OPTIONAL {
    FILTER(ISBLANK(?o1))
    ?o1 ?p2 ?o2 .
    OPTIONAL {
      FILTER(ISBLANK(?o2))
      ?o2 ?p3 ?o3 .
    }
  }  
}
jamiefeiss commented 10 months ago

Looks good, nice and fast at about 0.035s.

Not aggregating just means you'll get duplicate results in the case where a result satisfies multiples matches.

What do you think of restricting the matched predicate to labels & descriptions? Description matching could be worth less too. Also what do you think of restricting the base class to classes Prez supports?

recalcitrantsupplant commented 10 months ago

What do you think of restricting the matched predicate to labels & descriptions?

This would be a closed profile with no properties defined. You'll then get labels/descriptions when the annotations are added. Profiles changes coming soon ..

Description matching could be worth less too.

Sounds good - any issue adding LCASE back in too for "exact" match?

      {
        ?search_result_uri ?predicate ?match .
        BIND (100 AS ?w)
        FILTER (LCASE(?match) = "open")
      } 
      UNION
...

Also what do you think of restricting the base class to classes Prez supports?

Ideally I think prez could display whatever information about whatever object is found, perhaps on a generic page if there isn't a suitable endpoint

recalcitrantsupplant commented 10 months ago

David to:

recalcitrantsupplant commented 9 months ago

Resolved in #149