Closed WolfgangFahl closed 3 years ago
I implemented a simple query split-up for a defined threshold (currently 1000). For example a query with 998 results is split into the following queries:
[[modification_date::+]] | offset=0 | limit=100
[[modification_date::+]] | offset=100 | limit=100
[[modification_date::+]] | offset=200 | limit=100
[[modification_date::+]] | offset=300 | limit=100
[[modification_date::+]] | offset=400 | limit=100
[[modification_date::+]] | offset=500 | limit=100
[[modification_date::+]] | offset=600 | limit=100
[[modification_date::+]] | offset=700 | limit=100
[[modification_date::+]] | offset=800 | limit=100
[[modification_date::+]] | offset=900 | limit=100
Is that the behavior that you are looking for?
That looks good. All we need would be some batching control like https://github.com/WolfgangFahl/ProceedingsTitleParser/blob/4284bc33a29479eba6332e02c7108176425dadc1/ptp/openresearch.py#L65 has
def cacheEvents(self,em,limit=500,batch=5000):
offset=0
if self.profile:
print("retrieving events for openresearch.org cache from SMW")
while True:
found,event=self.cacheEventBatch(em,limit=batch,offset=offset)
if self.profile:
em.showProgress("retrieved events %5d-%5d" % (offset+1,offset+found))
em.showProgress(event.asJson())
offset=offset+batch
if found<batch or len(em.events)>=limit:
break
return em
so basically instead of having a fixed THRESHOLD.limit be able to specify things in the call (or as you implemented in the constructor ...)
it would be good if we could find out the batch size but for a start it would be o.k. to specify it via command line.
https://github.com/WolfgangFahl/pyLoDStorage/blob/53582c643773242c08bf3c4ee986d22d11b0b1cb/lodstorage/sparql.py#L108 also has batch handling - there it is not so much because of the limit but to be better able to handle a progress bar and the like.
Please also add a test case for the new behavior if you intend to create a pull request. Good work!
During the implementation of the tests I noticed that the $smwgQMaxLimit can not be bypassed by splitting the query into multiple sub-queries with limit and offset.
For example the query [[modification date::+]] | offset=0 | limit=10000 returns 10000 results as specified by the limit and the result also contains the parameter query-continue-offset=10000 indicating that the query has more results that can be queried. But by executing the query [[modification date::+]] | offset=10000 | limit=10000 to get the next 10000 results, an empty result is returned because we ask for results that are over the $smwgQMaxLimit.
We know that there are more results to the query from the definition of "query-continue-offset":
For example the result of the query [[Category:Person]] | offset=0 | limit=10000 does not contain the "query-continue-offset" parameter indicating that the 5170 results are all answers of the query. The definition of offset states that:
If the limit specified with this parameter exceeds the limit set with configuration parameter $smwgQUpperbound, the offset will fallback to the limit of "0"
This means we can not query more results than the defined limit.
Only option I see to bypass this limitation is to limit the query results e.g. by adding [[modification date::<2020]] to break the query down to small enough date intervals as you mentioned above. Since count is not working, one way to create the intervals could be to query with $smwgQMaxLimit as limit and check if "query-continue-offset" is set. If it is set choose a smaller interval. But this split up into multiple intervals also makes the querying more complex if the query itself queries for the modification date or requires a ordering. @WolfgangFahl do you prefer the split-up into intervals or should the user just be informed that the results are incomplete if they exceed the limitation.
@tholzheim Thank you for your excellent analysis. I think we should offer an automatic modification date split. It seem that we can get at the number of results to be expected in some way either with a format=count or the query-continue-offset method. Again I'd propose an extra parameter e.g. --avoidLimit or the like to enable the functionality.
The acceptance criterion would be that a
wikibackup -s or
would work for openresearch.org - with a sensible -q parameter
ask results are limited e.g. to 1000/10.000 results based on the server configuration and the rights of the account being used.
When doing a query like
The resultsize is the number of pages in the wiki which might well be far more than 10.000
There should be an approach to workaround the limitation e.g. by batching or splitting a query into multiple queries. E.g. the above query can be modified to
To only select a range of dates - by then going range by range e.g. the query for the backup can be split to a point where it works e.g. by trying it out with a count result first.