Closed Torvin closed 4 years ago
Uhm... This is a little bit tricky. But if we restrict the scope of issue into the erroneous presence of clcontinue
, I can try to clear the continuation parameters upon pagination and see whether there are any regressions.
Shouldn't it be like this (pseudocode)?
async IAsyncEnumerable<JObject> QueryWithContinuation(JObject queryParams)
{
var currentParams = queryParams;
for (; ; )
{
var result = await DoRequest(currentParams);
yield return result["query"];
var cont = result["continue"];
if (cont == null)
break;
currentParams = new JObject(queryParams);
currentParams.Merge(cont);
}
}
What's tricky about that? Or am I missing something?
The tricky part, I think, is you are using RequestHelper.QueryWithContinuation
(via your CategoryPropertyProvider
) with CategoryMembersGenerator
, and you want CategoryPropertyProvider
to support pagination.
WikiPagePropertyProvider<T>
is not designed to handle pagination. On the other hand, as I've mentioned in #67, while WikiPagePropertyList<T>
handles pagination, it is not designed to fetch page properties for multiple page in 1 batch. And that's why links
, backlinks
, categories
are designed as derived class of WikiPagePropertyList
instead of WikiPagePropertyProvider
.
I don't know why but the original CategoryPropertyProvider is marked as internal
There is no CategoryPropertyProvider
in the library. What is designed for your scenario is CategoriesGenerator
. Unfortunately, by design you can only feed pages one by one to CategoriesGenerator.PageTitle
property.
The cause of the current situation is that years gao I haven't investigate thoroughly how the query conituation works when both generator=
and prop=
needs pagination.
When pagination happens, the ParseContinuationParameters
call in QueryWithContinuation
won't be able to tell whether MW API response is continue listing the category of the first page, or is continue listing the pages. The current implementation in ParseContinuationParameters
assumes we are always continuing listing the pages, but it's not correct
Like what you have written in your pseudocode, This part does not check what (generator? or one, some, or all of the props?) we are continuing.
currentParams = new JObject(queryParams);
currentParams.Merge(cont);
However, it seems that we can assume MW API response continues listing props
content first, then continues listing generator
pages. I think we just need a reliable indicator for "props listing has completed". It seems that batchcomplete
is a very promising candidate, but it's not supported on MW 1.25-...
Anyway, before your requirement of dual pagination can be implemeted, for now, at least I can fix the issue of leftover continuation
parameters.
You can compare the response of the following two requests https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=categories&generator=allpages&cllimit=2&gapfrom=Australian_cuisine&gaplimit=2
{
"continue": {
"clcontinue": "62264|Australian_cuisine",
"continue": "||"
},
"query": {
"pages": {
"62264": {
"pageid": 62264,
"ns": 0,
"title": "Australian cuisine",
"categories": [
{
"ns": 14,
"title": "Category:All articles with unsourced statements"
},
{
"ns": 14,
"title": "Category:Articles with unsourced statements from October 2019"
}
]
},
"55714871": {
"pageid": 55714871,
"ns": 0,
"title": "Australian cultural identity"
}
}
}
}
{
"batchcomplete": "",
"continue": {
"gapcontinue": "Australian_culture",
"continue": "gapcontinue||"
},
"query": {
"pages": {
"62264": {
"pageid": 62264,
"ns": 0,
"title": "Australian cuisine",
"categories": [
{
"ns": 14,
"title": "Category:Wikipedia articles with LCCN identifiers"
},
{
"ns": 14,
"title": "Category:Wikipedia articles with SUDOC identifiers"
}
]
},
"55714871": {
"pageid": 55714871,
"ns": 0,
"title": "Australian cultural identity"
}
}
}
}
Thank you for a thorough write-up!
And I suddenly realized years ago, my consideration for your scenario is that you can actually construct a batch of WikiPage
instance and refresh it. This is also the approch used to be taken by pywikibot. Unlike QueryWithContinuation
, RefreshAync
can automatically merge paginated props of the same page together
https://github.com/CXuesong/WikiClientLibrary/blob/9f12cbd95bfb75fd3e0f22a62e1f76b78cf5ae25/WikiClientLibrary/RequestHelper.cs#L236
As long as you don't exceed the restriction on the maximum count of pages support in a batch, you can fetch for the complete content of your page properties, such as categories
. The only problem is that, there is no such a class like CategoryPropertyProvider
.
Since you have already implement your own PropertyProvider, the current suggestion I have for you is to
CategoryMembersGenerator.ListItemAsync
to get a sequence of WikiPageStub
s,WikiPageStub
s into WikiPage
s,WikiPage
sequence by batches (e.g., pywikibot batches pages by 30 items)var client = new WikiClient();
var site = new WikiSite(client, "https://en.wikipedia.org/w/api.php");
await site.Initialization;
var result = await new CategoryMembersGenerator(site)
{
CategoryTitle = "Category:Oceanian_cuisine", MemberTypes = CategoryMemberTypes.Page, PaginationSize = 2,
}.EnumItemsAsync().Select(stub => new WikiPage(site, stub.Title)).Take(10).ToListAsync();
// TODO: You can use Batch extension method from Ix.Async, and await foreach, to iterate through batches.
await result.RefreshAsync(new WikiPageQueryProvider { Properties = { new CategoryPropertyProvider { PaginationSize = 2 } } });
foreach (var page in result)
{
Output.WriteLine(page.Title);
Output.WriteLine("Category:" + string.Join(",", page.GetPropertyGroup<CategoryPropertyGroup>().Categories));
}
There is no CategoryPropertyProvider in the library.
There is, I just got the name wrong: https://github.com/CXuesong/WikiClientLibrary/blob/master/WikiClientLibrary/Pages/Queries/Properties/CategoriesPropertyProvider.cs
It's still internal
, unfortunately
I suppose it was not ready to be exposed to public at that time and then lost in my memory 🌚
Then perhaps I can make this class public now, but you need to use it with caution: do not let dual continuation happen.
Released v0.7.1
to fix the continuation
parameter issue and exposed CategoriesPropertyProvider
.
However, note that the code you have provided in the issue description will still break as I've discussed above.
If you feel the comment above is too overwhelming, perhaps we can chat further about this, on GitHub or not.
RequestHelper.QueryWithContinuation()
doesn't correctly implement pagination logic. It retains the continuation parameters from the previous request. E.g. if the API returns these continuation objects in 2 consecutive requests:The next request sent by the library will have
I.e.
"clcontinue": "aaa"
will still be sent, even though it shouldn't be there. This results in incomplete data.I wasn't able to reproduce this issue with standard providers easily, but it's reproducible with this
CategoryPropertyProvider
I wrote to retrieve categories of the page (I don't know why but the originalCategoryPropertyProvider
is marked asinternal
):Here
test
will be0
even though the real article has 2 categories.