WolfgangFahl / py-3rdparty-mediawiki

Wrapper for pywikibot and mwclient MediaWiki API librarties with improvements for 3rd party wikis
Apache License 2.0
4 stars 5 forks source link

Overcome upper limit of SMW ask queries #42

Closed WolfgangFahl closed 3 years ago

WolfgangFahl commented 3 years ago

ask results are limited e.g. to 1000/10.000 results based on the server configuration and the rights of the account being used.

When doing a query like

[[modification date::+]]

The resultsize is the number of pages in the wiki which might well be far more than 10.000

There should be an approach to workaround the limitation e.g. by batching or splitting a query into multiple queries. E.g. the above query can be modified to

[[modification date::<2020]][[modification date::>2018]]

To only select a range of dates - by then going range by range e.g. the query for the backup can be split to a point where it works e.g. by trying it out with a count result first.

tholzheim commented 3 years ago

I implemented a simple query split-up for a defined threshold (currently 1000). For example a query with 998 results is split into the following queries:

[[modification_date::+]] | offset=0 | limit=100
[[modification_date::+]] | offset=100 | limit=100
[[modification_date::+]] | offset=200 | limit=100
[[modification_date::+]] | offset=300 | limit=100
[[modification_date::+]] | offset=400 | limit=100
[[modification_date::+]] | offset=500 | limit=100
[[modification_date::+]] | offset=600 | limit=100
[[modification_date::+]] | offset=700 | limit=100
[[modification_date::+]] | offset=800 | limit=100
[[modification_date::+]] | offset=900 | limit=100

Is that the behavior that you are looking for?

WolfgangFahl commented 3 years ago

That looks good. All we need would be some batching control like https://github.com/WolfgangFahl/ProceedingsTitleParser/blob/4284bc33a29479eba6332e02c7108176425dadc1/ptp/openresearch.py#L65 has

def cacheEvents(self,em,limit=500,batch=5000):
        offset=0
        if self.profile:
            print("retrieving events for openresearch.org cache from SMW")
        while True:
            found,event=self.cacheEventBatch(em,limit=batch,offset=offset)
            if self.profile:
                em.showProgress("retrieved events %5d-%5d" % (offset+1,offset+found))
                em.showProgress(event.asJson())
            offset=offset+batch
            if found<batch or len(em.events)>=limit:
                break

        return em

so basically instead of having a fixed THRESHOLD.limit be able to specify things in the call (or as you implemented in the constructor ...)

it would be good if we could find out the batch size but for a start it would be o.k. to specify it via command line.

https://github.com/WolfgangFahl/pyLoDStorage/blob/53582c643773242c08bf3c4ee986d22d11b0b1cb/lodstorage/sparql.py#L108 also has batch handling - there it is not so much because of the limit but to be better able to handle a progress bar and the like.

Please also add a test case for the new behavior if you intend to create a pull request. Good work!

tholzheim commented 3 years ago

During the implementation of the tests I noticed that the $smwgQMaxLimit can not be bypassed by splitting the query into multiple sub-queries with limit and offset.

For example the query [[modification date::+]] | offset=0 | limit=10000 returns 10000 results as specified by the limit and the result also contains the parameter query-continue-offset=10000 indicating that the query has more results that can be queried. But by executing the query [[modification date::+]] | offset=10000 | limit=10000 to get the next 10000 results, an empty result is returned because we ask for results that are over the $smwgQMaxLimit.

We know that there are more results to the query from the definition of "query-continue-offset":

"The API result contains a "query-continue-offset" key, which can be used to fetch additional results: &parameters=offset%3D10|limit%3D10. If there is no "query-continue-offset" key in the result, the end of the result set was reached. "

For example the result of the query [[Category:Person]] | offset=0 | limit=10000 does not contain the "query-continue-offset" parameter indicating that the 5170 results are all answers of the query. The definition of offset states that:

If the limit specified with this parameter exceeds the limit set with configuration parameter $smwgQUpperbound, the offset will fallback to the limit of "0"

This means we can not query more results than the defined limit.

Only option I see to bypass this limitation is to limit the query results e.g. by adding [[modification date::<2020]] to break the query down to small enough date intervals as you mentioned above. Since count is not working, one way to create the intervals could be to query with $smwgQMaxLimit as limit and check if "query-continue-offset" is set. If it is set choose a smaller interval. But this split up into multiple intervals also makes the querying more complex if the query itself queries for the modification date or requires a ordering. @WolfgangFahl do you prefer the split-up into intervals or should the user just be informed that the results are incomplete if they exceed the limitation.

WolfgangFahl commented 3 years ago

@tholzheim Thank you for your excellent analysis. I think we should offer an automatic modification date split. It seem that we can get at the number of results to be expected in some way either with a format=count or the query-continue-offset method. Again I'd propose an extra parameter e.g. --avoidLimit or the like to enable the functionality.

The acceptance criterion would be that a

wikibackup -s or 

would work for openresearch.org - with a sensible -q parameter