KorAP / Krill

:mag: A Corpus Data Retrieval Index using Lucene for Look-Ups
BSD 2-Clause "Simplified" License
16 stars 3 forks source link

access-rewrite-disabled slows page retrieval down by factor ~6 #75

Closed kupietz closed 3 years ago

kupietz commented 3 years ago
#!/bin/bash

function testQuery() {
    echo "access-rewrite-disabled=$1"
    curl -w 'total time: %{time_total}\n' -o /dev/null -s "https://korap.ids-mannheim.de/api/v1.0/search?access-rewrite-disabled=$1&count=50&q=Sockenpuppe&cq=corpusSigle%3D%2FW.D.%2A%2F&ql=poliqarp&fields=corpusSigle,textSigle,pubDate,pubPlace,availability,textClass&offset=0&cutoff=true"
}

testQuery true
testQuery false

Prints something like:

access-rewrite-disabled=true
total time: 3.211718
access-rewrite-disabled=false
total time: 0.431929
Akron commented 3 years ago

I expect that's the weird metric I thought was wrong. I guess it's a bug in Krill - could you give me a hint on how the request looks like, @margaretha ?

Akron commented 3 years ago

Direct benchmark shows a performance improvement - as expected: https://korap.ids-mannheim.de/gerrit/c/KorAP/Krill/+/4540 , so I guess there is a problem with the caller.

kupietz commented 3 years ago

Direct benchmark shows a performance improvement - as expected: https://korap.ids-mannheim.de/gerrit/c/KorAP/Krill/+/4540 , so I guess there is a problem with the caller.

The caller in Kustvakt?

Akron commented 3 years ago

We were able to golf it down to the VC - it's not really the function that is slower, it's the difference in the query. The scope including the rewrite is way smaller. But we have no real idea, why it is so slow, as the cutoff still limits the search to a few hits. Maybe it's the regex in the dictionary that's so slow ....

kupietz commented 3 years ago

I'd give this a high priority. Even if this was only factor 1.6 it'd be worth a couple of k€ in hardware.

An interim solution, however, could be to turn off access-rewrite-disabled by default in the client libraries, at least if the user has a token.

Akron commented 3 years ago

But when do you set this option? Maybe there is a misunderstanding what this option does? Maybe there is a more appropriate option to use then disabling the access rewrite?

Akron commented 3 years ago

However - I agree we should find out the origin of the slow down. I guess it's either the dictionary lookup or the way we build the VCs.

margaretha commented 3 years ago

But when do you set this option? Maybe there is a misunderstanding what this option does? Maybe there is a more appropriate option to use then disabling the access rewrite?

I think it is most useful when users do not have tokens, so they can get the metadata instead.

An interim solution, however, could be to turn off access-rewrite-disabled by default in the client libraries, at least if the user has a token.

This solution makes much sense.

Akron commented 3 years ago

But it alters the scope of the query. It is no "ignore the metadata" option. And in case the user has a token, the scope would be identical, right? Can you benchmark this? I would expect it would have a similar timing.

kupietz commented 3 years ago

We introduced this option for the R client, I think. I also suspect the pre reduced scope without token is the source of the speed up. I thought I had a token when I discovered this, but in fact I had not, due to a bug in the R library.

If it is like this we can't circumvent the problem in the client and it might be not so easy to fix.

Akron commented 3 years ago

Eliza expanded the query based on the assumption there was no token involved, that's what we tested.

There were two features introduced here: The option to bypass access rewrite and the option to not fetch snippets. Not fetching snippets is always faster, but bypassing access rewrite can mean that the search scope is larger - which is in principal slower, but shouldn't be that much slower due to the cutoff. I think there are very specific use cases for these options, but I don't think that bypassing the access rewrite is a good default ...

Akron commented 3 years ago

After further investigation (thanks @margaretha !), we can assume that the problem is in the way the VC is created before iterating through the result. There are two code changes helpful here:

  1. Merge the regex queries in or-groups prior to VC building (then Lucene takes care of the query building).
  2. Make the VC iterable.

Regarding 2: The truth is - I thought it is, but I guess I was wrong. I'll have to investigate further in how to accomplish that.

kupietz commented 3 years ago

Sorry, with a valid access token the gain shrinks to 0. I wrongly assumed, I had one, because of a bug in RKorAPClient (here: https://github.com/KorAP/RKorAPClient/commit/01c2477e36af2a9b91fd6092606ff5ccdb23bba2 ).