High-frequency queries from https://wikidocumentaries-demo.wmcloud.org/

ad-freiburg / qlever

Very fast SPARQL Engine, which can handle very large knowledge graphs like the complete Wikidata, offers context-sensitive autocompletion for SPARQL queries, and allows combination with text search. It's faster than engines like Blazegraph or Virtuoso, especially for queries involving large result sets.

Apache License 2.0

394 stars 47 forks source link

High-frequency queries from https://wikidocumentaries-demo.wmcloud.org/ #1548

Open hannahbast opened 1 week ago

hannahbast commented 1 week ago

@tuukka For some time now, we are receiving a very high volume of queries (ten queries per second and more, around the clock) from https://wikidocumentaries-demo.wmcloud.org . This looks like either disrespectful crawlers or bots, or a script gone astray. Can you please check?

And are you using some caching mechanism to avoid issuing too many queries?

tuukka commented 5 days ago

I haven't made any changes lately, but now that I check, Google seems to have suddenly resumed crawling the site (using the user agent GoogleOther) and at a rate of some 100K pages per day or 1 per second.

There's one quick way to reduce the number of requests I send to your direction: I've now disabled the retry logic (exponential backoff) which I had in case of the error responses 400 and 429 (indeterministic out-of-memory errors etc.) Does it help anything?

I'm not using any caching mechanism and I don't think it would help, as all the crawled pages refer to a different Wikidata item so the Sparql query will also differ on the Wikidata item to query.

I'm currently sending as many queries as there are facets in the UI (currently three) - I don't know if it would work to join these queries into one or if it would cause more out-of-memory errors.

hannahbast commented 3 days ago

@tuukka Thank you for your reply! it's now back to one query every 1-2 seconds, which is reasonable.

But I am curious: can you tell from your logs how many queries per day come from actual users and how many come from bots?

tuukka commented 2 days ago

But I am curious: can you tell from your logs how many queries per day come from actual users and how many come from bots?

I don't see the requests made from the clients towards QLever, so I have to use page loads as an approximation.

I ran a simple analysis for yesterday's page load logs: 52162 loads (55.47%) by GoogleOther, 38120 loads (40.55%) by other bots, 3746 loads (3.98%) by actual users.

Regarding actual users, I had a look at the referer data: 311 loads (8.30%) came directly from the wikis (mainly Commons), 960 loads (25.63%) had no referer data, 2441 loads (65.16%) came from navigation within the service.

The numbers you see should be at least 3 times higher (based on the number of facets). Now that I think of it, if I fetched the facets only after fetching the images, I could skip the facet queries whenever I get 0 images. Further, whenever the number of images is small, I could perhaps calculate the facets fully client-side.