Querying by CTID via regex is slow

siuc-nate commented 4 years ago

I am seeing query times between 10 and 15 seconds for a simple query like this (the current regex being output by the search):

PREFIX credreg: <https://credreg.net/> PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT ?totalResults ?searchResultPayload WHERE { { SELECT (COUNT(DISTINCT ?id) AS ?totalResults) WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) ?485010f7299640dcb4918d2a6a8afcec . FILTER ( regex(?485010f7299640dcb4918d2a6a8afcec, '(.*ce-ff2ea499-7559-4ce0-a573-701dd1849076.*)|(.*ce-ff2ea499-7559-4ce0-a573-701dd184?9?0?7?6?.*)', 'i') ) . } } } } UNION { SELECT DISTINCT ?searchResultPayload WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) ?485010f7299640dcb4918d2a6a8afcec . FILTER ( regex(?485010f7299640dcb4918d2a6a8afcec, '(.*ce-ff2ea499-7559-4ce0-a573-701dd1849076.*)|(.*ce-ff2ea499-7559-4ce0-a573-701dd184?9?0?7?6?.*)', 'i') ) . } } ?id credreg:__payload ?searchResultPayload . } ORDER BY DESC(?id) OFFSET 0 LIMIT 1 } }

If I simplify the regex quite a bit, it gets down to around 9 seconds:

PREFIX credreg: <https://credreg.net/> PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT ?totalResults ?searchResultPayload WHERE { { SELECT (COUNT(DISTINCT ?id) AS ?totalResults) WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) ?485010f7299640dcb4918d2a6a8afcec . FILTER ( regex(?485010f7299640dcb4918d2a6a8afcec, 'ce-ff2ea499-7559-4ce0-a573-701dd1849076', 'i') ) . } } } } UNION { SELECT DISTINCT ?searchResultPayload WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) ?485010f7299640dcb4918d2a6a8afcec . FILTER ( regex(?485010f7299640dcb4918d2a6a8afcec, 'ce-ff2ea499-7559-4ce0-a573-701dd1849076', 'i') ) . } } ?id credreg:__payload ?searchResultPayload . } ORDER BY DESC(?id) OFFSET 0 LIMIT 1 } }

And if I remove the regex all together, it's very fast (<1 second):

PREFIX credreg: <https://credreg.net/> PREFIX ceterms: <https://purl.org/ctdl/terms/> SELECT ?totalResults ?searchResultPayload WHERE { { SELECT (COUNT(DISTINCT ?id) AS ?totalResults) WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) 'ce-ff2ea499-7559-4ce0-a573-701dd1849076' . } } } } UNION { SELECT DISTINCT ?searchResultPayload WHERE { ?id ceterms:ctid ?anyValue . {  { ?id ( ceterms:ctid ) 'ce-ff2ea499-7559-4ce0-a573-701dd1849076' . } } ?id credreg:__payload ?searchResultPayload . } ORDER BY DESC(?id) OFFSET 0 LIMIT 1 } }

However, I need a fairly deep regex in order to make sure that string matching has enough "fuzziness" to it to find reasonable result sets. I'm not sure there's a solution to this (particularly since CTID is already a literal, so we don't have the langString overhead) but I figured I'd raise it here to see if there are any ideas.

excelsior commented 4 years ago

I couldn't find any way of speeding such queries up so far. But this is just an illustration, isn't it? I mean you don't actually filter CTIDs that way, but regular expressions are used as the means of performing full-text searches, right? If the latter is true, Neptune integrated Elasticsearch into its latest engine version, so we can research these new capabilities for solving this problem.

siuc-nate commented 4 years ago

I am currently using regex to query both langString and string fields. It's unlikely that CTID would be used this way in a real world scenario, but it demonstrates the problem of "this string field takes an unusually long time to query with any kind of fuzziness afforded by regex". The same would apply to any other string field.

Looking at the example queries, it seems that the service would be dependent on some existing elasticsearch data store rather than something actually baked into the SPARQL engine itself. It seems like a lot of additional overhead, but maybe it would be worth looking into down the road.

excelsior commented 4 years ago

Yes, the integration is between Neptune and an AWS-hosted ES service in a sense that it allows running federated queries in order to perform full-text searches. I haven't dug too deep yet as to how the ES store need to be set up to achieve that.

Question: Do equivalent queries perform noticeably better in Gremlin?

siuc-nate commented 4 years ago

That depends mostly on your definition of "equivalent", since the gremlin query brings back the count and the results in a single query. It runs in about 3 seconds. Running the equivalent (in the sense of what it brings back) SPARQL query takes 10-11 seconds. However, if I cut the SPARQL query down to just getting the results, it takes about 3 seconds as well. It would be really nice if there were a way to avoid having to run the same query an extra time just to get a total results count. I could try running those as separate queries asynchronously and simultaneously to see if that gets the total time down lower I suppose.

siuc-nate commented 4 years ago

I tried updating our code to query for the total results and the payload data as separate HTTP requests that run simultaneously - this cut the result time down to about 9 seconds for the CTID in the original post. A small improvement, but I'll take it. It may help other, more complex queries return results a little bit faster as well.

edgarf commented 4 years ago

@siuc-nate does it mean we can close this issue for now?

siuc-nate commented 4 years ago

Closing this, since we're pretty much stuck with regex when it comes to handling text searches of the sort that we need to handle.

CredentialEngine / CredentialRegistry

Querying by CTID via regex is slow #309