Search MVP default search method with IDN data

RDFLib / prez

Prez is a data-configurable Linked Data API framework that delivers profiles of Knowledge Graph data according to the Content Negotiation by Profile standard.

BSD 3-Clause "New" or "Revised" License

24 stars 10 forks source link

Search MVP default search method with IDN data #153

Open jamiefeiss opened 1 year ago

jamiefeiss commented 1 year ago

Testing the "default" regex search method takes over 30s against the IDN triplestore for the following query:

http://localhost:8000/search?term=open&method=default&limit=10&focus-to-filter[rdf:type]=http%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23Concept&focus-to-filter[skos:inScheme]=https%3A%2F%2Flinked.data.gov.au%2Fdef%2Fdata-access-rights

PREFIX dcterms: <http://purl.org/dc/terms/>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX prez: <https://prez.dev/>
    CONSTRUCT {
    ?hashID a prez:SearchResult ;
        prez:searchResultWeight ?weight ;
        prez:searchResultPredicate ?predicate ;
        prez:searchResultMatch ?match ;
        prez:searchResultURI ?search_result_uri . 
        ?search_result_uri ?p ?o1 .

        ?o1 ?p2 ?o2 .
        ?o2 ?p3 ?o3 .     
}
    WHERE {
        { 
    SELECT ?search_result_uri ?predicate ?match ?weight ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .

        ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept>.
?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights>.

      FILTER (
        LCASE(?match) = "open" ||
        REGEX(?match, "^open", "i") ||
        REGEX(?match, "\bopen\b", "i") ||
        REGEX(?match, "open", "i")
      )
      BIND(
        IF(LCASE(?match) = "open", 10,
          IF(REGEX(?match, "^open", "i"), 7,
            IF(REGEX(?match, "\bopen\b", "i"), 5,
              IF(REGEX(?match, "open", "i"), 3, 0)
            )
          )
        ) AS ?weight
      )
    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match), STR(?weight))))) AS ?hashID)
    }
    LIMIT 10
         }
        {
            ?search_result_uri ?p ?o1 . 

                                        OPTIONAL {
                FILTER(ISBLANK(?o1))
                ?o1 ?p2 ?o2 .
                OPTIONAL {
                        FILTER(ISBLANK(?o2))
                        ?o2 ?p3 ?o3 .
                }
        }        }

        UNION {
                    }
    }

jamiefeiss commented 1 year ago

Testing with just the inner SELECT query now, the main issue seems to be that this query searches across all triples. Also, this weighted regex is significantly faster (0.035s vs 29.189s in Fuseki) if we implement something similar to the "skosWeighted" search method - https://github.com/RDFLib/prez/blob/main/prez/reference_data/search_methods/search_skos_weighted.ttl . See below:

SELECT ?search_result_uri ?predicate ?match (SUM(?w) AS ?weight) ?hashID
WHERE {
    ?search_result_uri ?predicate ?match .
    ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
    ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .

    BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)

  {
    ?search_result_uri ?predicate ?match .
    BIND (50 AS ?w)
    FILTER (REGEX(?match, "^open$", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (20 AS ?w)
    FILTER (REGEX(?match, "^open", "i"))
  } UNION {
    ?search_result_uri ?predicate ?match .
    BIND (10 AS ?w)
    FILTER (REGEX(?match, "open", "i"))
  }
} GROUP BY ?search_result_uri ?predicate ?match ?hashID ORDER BY DESC(?weight) LIMIT 10

Since we'll probably only be searching across labels & descriptions, and returning objects that have endpoints in Prez, we could restrict the predicates that are matched and the base classes of the results to further optimise the query.

recalcitrantsupplant commented 1 year ago

Looks like it's the query structure. Lets see if we can add back in the CONSTRUCT to your performant REGEX above.

For context as well, FTS query below.

http://idn-fuseki-lb-155137521.ap-southeast-2.elb.amazonaws.com:3030/#/dataset/idn/query?query=PREFIX%20skos%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2004%2F02%2Fskos%2Fcore%23%3E%0APREFIX%20dcterms%3A%20%3Chttp%3A%2F%2Fpurl.org%2Fdc%2Fterms%2F%3E%0APREFIX%20ex%3A%20%3Chttp%3A%2F%2Fwww.example.org%2Fresources%23%3E%0APREFIX%20text%3A%20%3Chttp%3A%2F%2Fjena.apache.org%2Ftext%23%3E%0APREFIX%20sdo%3A%20%3Chttps%3A%2F%2Fschema.org%2F%3E%0APREFIX%20rdfs%3A%20%3Chttp%3A%2F%2Fwww.w3.org%2F2000%2F01%2Frdf-schema%23%3E%0A%0ASELECT%20%3FMatchURI%20%28COALESCE%28%3Fprop_label%2C%20%3FMatchProp%29%20AS%20%3FMatchProperty%29%20%3FMatchTerm%20%3FSearchTerm%0A%7B%0A%20%20VALUES%20%3FSearchTerm%20%7B%22%2Aopen%2A%22%0A%20%20%7D%0A%20%20%28%3FMatchURI%20%3FWeight%20%3FMatchTerm%20%3Fgraph%20%3FMatchProp%29%20text%3Aquery%20%28%20ex%3ANameProps%20%3FSearchTerm%29%20.%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchURI%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fmatch_label%20.%0A%20%20%7D%0A%0A%20%20OPTIONAL%20%7B%0A%20%20%20%20%3FMatchProp%20skos%3AprefLabel%7Crdfs%3Alabel%7Cdcterms%3Atitle%7Csdo%3Aname%20%3Fprop_label%20.%0A%20%20%7D%0A%7D

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX ex: <http://www.example.org/resources#>
PREFIX text: <http://jena.apache.org/text#>
PREFIX sdo: <https://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT ?MatchURI (COALESCE(?prop_label, ?MatchProp) AS ?MatchProperty) ?MatchTerm ?SearchTerm
{
  VALUES ?SearchTerm {"*open*"
  }
  (?MatchURI ?Weight ?MatchTerm ?graph ?MatchProp) text:query ( ex:NameProps ?SearchTerm) .

  OPTIONAL {
    ?MatchURI skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?match_label .
  }

  OPTIONAL {
    ?MatchProp skos:prefLabel|rdfs:label|dcterms:title|sdo:name ?prop_label .
  }
}

recalcitrantsupplant commented 1 year ago

How does this look?

uses UNION structure as above from Jamie to improve performance
adds back in properties/blank nodes for objects
provides different matches rather than aggregating as per original query - I'm not too fused either way - @hjohns @jamiefeiss any opinions on adding weights vs providing multiple search results, one per weight?

PREFIX prez: <https://prez.dev/>
CONSTRUCT {
  ?hashID a prez:SearchResult ;
    prez:searchResultWeight ?w ;
    prez:searchResultPredicate ?predicate ;
    prez:searchResultMatch ?match ;
    prez:searchResultURI ?search_result_uri . 
  ?search_result_uri ?p ?o1 .
  ?o1 ?p2 ?o2 .
  ?o2 ?p3 ?o3 .     
}
WHERE {
  {
    SELECT ?search_result_uri ?predicate ?match ?w ?hashID
    WHERE {
      ?search_result_uri ?predicate ?match .
      ?search_result_uri <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> <http://www.w3.org/2004/02/skos/core#Concept> .
      ?search_result_uri <http://www.w3.org/2004/02/skos/core#inScheme> <https://linked.data.gov.au/def/data-access-rights> .
      BIND(URI(CONCAT("urn:hash:", SHA256(CONCAT(STR(?search_result_uri), STR(?predicate), STR(?match))))) AS ?hashID)
      {
        ?search_result_uri ?predicate ?match .
        BIND (50 AS ?w)
        FILTER (REGEX(?match, "^open$", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (20 AS ?w)
        FILTER (REGEX(?match, "^open", "i"))
      } UNION {
        ?search_result_uri ?predicate ?match .
        BIND (10 AS ?w)
        FILTER (REGEX(?match, "open", "i"))
      }
    }
    GROUP BY ?search_result_uri ?predicate ?match ?hashID ?w 
    LIMIT 10
  }
  ?search_result_uri ?p ?o1 .
  OPTIONAL {
    FILTER(ISBLANK(?o1))
    ?o1 ?p2 ?o2 .
    OPTIONAL {
      FILTER(ISBLANK(?o2))
      ?o2 ?p3 ?o3 .
    }
  }  
}

jamiefeiss commented 1 year ago

Looks good, nice and fast at about 0.035s.

Not aggregating just means you'll get duplicate results in the case where a result satisfies multiples matches.

What do you think of restricting the matched predicate to labels & descriptions? Description matching could be worth less too. Also what do you think of restricting the base class to classes Prez supports?

recalcitrantsupplant commented 1 year ago

What do you think of restricting the matched predicate to labels & descriptions?

This would be a closed profile with no properties defined. You'll then get labels/descriptions when the annotations are added. Profiles changes coming soon ..

Description matching could be worth less too.

Sounds good - any issue adding LCASE back in too for "exact" match?

      {
        ?search_result_uri ?predicate ?match .
        BIND (100 AS ?w)
        FILTER (LCASE(?match) = "open")
      } 
      UNION
...

Also what do you think of restricting the base class to classes Prez supports?

Ideally I think prez could display whatever information about whatever object is found, perhaps on a generic page if there isn't a suitable endpoint

recalcitrantsupplant commented 1 year ago

David to:

allow list of filter values in API, treat these as a VALUES clause, e.g. for search across multiple vocabularies
add predicates to query - REGEX performance is bad on even smaller datasets where there isn't some filtering.

recalcitrantsupplant commented 1 year ago

Resolved in #149