CXuesong / WikiClientLibrary

/*🌻*/ Wiki Client Library is an asynchronous MediaWiki API client library targeting modern .NET platforms
https://github.com/CXuesong/WikiClientLibrary/wiki
Apache License 2.0
80 stars 16 forks source link

Pagination is broken in presence of generators or multiple providers #68

Closed Torvin closed 4 years ago

Torvin commented 4 years ago

RequestHelper.QueryWithContinuation() doesn't correctly implement pagination logic. It retains the continuation parameters from the previous request. E.g. if the API returns these continuation objects in 2 consecutive requests:

"continue": {
    "clcontinue": "aaa",
    "continue": "||"
}
"continue": {
    "gcmcontinue": "bbb",
    "continue": "gcmcontinue||"
}

The next request sent by the library will have

"clcontinue": "aaa"
"gcmcontinue": "bbb",
"continue": "gcmcontinue||"

I.e. "clcontinue": "aaa" will still be sent, even though it shouldn't be there. This results in incomplete data.

I wasn't able to reproduce this issue with standard providers easily, but it's reproducible with this CategoryPropertyProvider I wrote to retrieve categories of the page (I don't know why but the original CategoryPropertyProvider is marked as internal):

class CategoryPropertyProvider : WikiPagePropertyProvider<CategoryPropertyGroup>
{
    public override string PropertyName => "categories";
    public int PaginationSize { get; set; }

    public override IEnumerable<KeyValuePair<string, object>> EnumParameters(MediaWikiVersion version)
    {
        yield return KeyValuePair.Create("clshow", "!hidden" as object);
        yield return KeyValuePair.Create("cllimit", PaginationSize as object);
    }

    public override CategoryPropertyGroup ParsePropertyGroup(JObject json)
    {
        return new CategoryPropertyGroup(json[PropertyName]?.Select(x => x.Value<string>("title")).ToArray());
    }
}

class CategoryPropertyGroup : WikiPagePropertyGroup
{
    public CategoryPropertyGroup(IReadOnlyList<string> categories)
    {
        Categories = categories ?? Array.Empty<string>();
    }

    public IReadOnlyList<string> Categories { get; }
}
static async Task Main()
{
    using var client = new WikiClient();
    var site = new WikiSite(client, "https://en.wikipedia.org/w/api.php");
    await site.Initialization;

    // set both limits to 1 to allow the bug manifest more easily
    var result = await new CategoryMembersGenerator(site)
    {
        CategoryTitle = "Category:Oceanian_cuisine",
        MemberTypes = CategoryMemberTypes.Page,
        PaginationSize = 1,
    }.EnumPagesAsync(new WikiPageQueryProvider
    {
        Properties =
        {
            new CategoryPropertyProvider() { PaginationSize = 1 }
        }
    }).Select(item => Tuple.Create(item.Title, item.GetPropertyGroup<CategoryPropertyGroup>())).ToArrayAsync();

   var test = result.Where(x => x.Item1 == "Australian cuisine").SelectMany(x => x.Item2.Categories).Count();
}

Here test will be 0 even though the real article has 2 categories.

CXuesong commented 4 years ago

Uhm... This is a little bit tricky. But if we restrict the scope of issue into the erroneous presence of clcontinue, I can try to clear the continuation parameters upon pagination and see whether there are any regressions.

Torvin commented 4 years ago

Shouldn't it be like this (pseudocode)?

async IAsyncEnumerable<JObject> QueryWithContinuation(JObject queryParams)
{
    var currentParams = queryParams;

    for (; ; )
    {
        var result = await DoRequest(currentParams);
        yield return result["query"];

        var cont = result["continue"];
        if (cont == null)
            break;

        currentParams = new JObject(queryParams);
        currentParams.Merge(cont);
    }
}

What's tricky about that? Or am I missing something?

CXuesong commented 4 years ago

The tricky part, I think, is you are using RequestHelper.QueryWithContinuation (via your CategoryPropertyProvider) with CategoryMembersGenerator, and you want CategoryPropertyProvider to support pagination.

WikiPagePropertyProvider<T> is not designed to handle pagination. On the other hand, as I've mentioned in #67, while WikiPagePropertyList<T> handles pagination, it is not designed to fetch page properties for multiple page in 1 batch. And that's why links, backlinks, categories are designed as derived class of WikiPagePropertyList instead of WikiPagePropertyProvider.

I don't know why but the original CategoryPropertyProvider is marked as internal

There is no CategoryPropertyProvider in the library. What is designed for your scenario is CategoriesGenerator. Unfortunately, by design you can only feed pages one by one to CategoriesGenerator.PageTitle property.


The cause of the current situation is that years gao I haven't investigate thoroughly how the query conituation works when both generator= and prop= needs pagination.

When pagination happens, the ParseContinuationParameters call in QueryWithContinuation won't be able to tell whether MW API response is continue listing the category of the first page, or is continue listing the pages. The current implementation in ParseContinuationParameters assumes we are always continuing listing the pages, but it's not correct

https://github.com/CXuesong/WikiClientLibrary/blob/ccbd04d4dab91ff109b6d80aad4fe0fd7d1ae92f/WikiClientLibrary/RequestHelper.cs#L136

Like what you have written in your pseudocode, This part does not check what (generator? or one, some, or all of the props?) we are continuing.

        currentParams = new JObject(queryParams);
        currentParams.Merge(cont);

However, it seems that we can assume MW API response continues listing props content first, then continues listing generator pages. I think we just need a reliable indicator for "props listing has completed". It seems that batchcomplete is a very promising candidate, but it's not supported on MW 1.25-...

Anyway, before your requirement of dual pagination can be implemeted, for now, at least I can fix the issue of leftover continuation parameters.

You can compare the response of the following two requests https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=categories&generator=allpages&cllimit=2&gapfrom=Australian_cuisine&gaplimit=2

{
    "continue": {
        "clcontinue": "62264|Australian_cuisine",
        "continue": "||"
    },
    "query": {
        "pages": {
            "62264": {
                "pageid": 62264,
                "ns": 0,
                "title": "Australian cuisine",
                "categories": [
                    {
                        "ns": 14,
                        "title": "Category:All articles with unsourced statements"
                    },
                    {
                        "ns": 14,
                        "title": "Category:Articles with unsourced statements from October 2019"
                    }
                ]
            },
            "55714871": {
                "pageid": 55714871,
                "ns": 0,
                "title": "Australian cultural identity"
            }
        }
    }
}

https://en.wikipedia.org/wiki/Special:ApiSandbox#action=query&format=json&prop=categories&continue=%7C%7C&generator=allpages&cllimit=2&clcontinue=62264%7CWikipedia_articles_with_LCCN_identifiers&gapfrom=Australian_cuisine&gaplimit=2

{
    "batchcomplete": "",
    "continue": {
        "gapcontinue": "Australian_culture",
        "continue": "gapcontinue||"
    },
    "query": {
        "pages": {
            "62264": {
                "pageid": 62264,
                "ns": 0,
                "title": "Australian cuisine",
                "categories": [
                    {
                        "ns": 14,
                        "title": "Category:Wikipedia articles with LCCN identifiers"
                    },
                    {
                        "ns": 14,
                        "title": "Category:Wikipedia articles with SUDOC identifiers"
                    }
                ]
            },
            "55714871": {
                "pageid": 55714871,
                "ns": 0,
                "title": "Australian cultural identity"
            }
        }
    }
}
Torvin commented 4 years ago

Thank you for a thorough write-up!

CXuesong commented 4 years ago

And I suddenly realized years ago, my consideration for your scenario is that you can actually construct a batch of WikiPage instance and refresh it. This is also the approch used to be taken by pywikibot. Unlike QueryWithContinuation, RefreshAync can automatically merge paginated props of the same page together https://github.com/CXuesong/WikiClientLibrary/blob/9f12cbd95bfb75fd3e0f22a62e1f76b78cf5ae25/WikiClientLibrary/RequestHelper.cs#L236

As long as you don't exceed the restriction on the maximum count of pages support in a batch, you can fetch for the complete content of your page properties, such as categories. The only problem is that, there is no such a class like CategoryPropertyProvider.

Since you have already implement your own PropertyProvider, the current suggestion I have for you is to

var client = new WikiClient();
var site = new WikiSite(client, "https://en.wikipedia.org/w/api.php");
await site.Initialization;

var result = await new CategoryMembersGenerator(site)
{
    CategoryTitle = "Category:Oceanian_cuisine", MemberTypes = CategoryMemberTypes.Page, PaginationSize = 2,
}.EnumItemsAsync().Select(stub => new WikiPage(site, stub.Title)).Take(10).ToListAsync();
// TODO: You can use Batch extension method from Ix.Async, and await foreach, to iterate through batches.
await result.RefreshAsync(new WikiPageQueryProvider { Properties = { new CategoryPropertyProvider { PaginationSize = 2 } } });
foreach (var page in result)
{
    Output.WriteLine(page.Title);
    Output.WriteLine("Category:" + string.Join(",", page.GetPropertyGroup<CategoryPropertyGroup>().Categories));
}
Torvin commented 4 years ago

There is no CategoryPropertyProvider in the library.

There is, I just got the name wrong: https://github.com/CXuesong/WikiClientLibrary/blob/master/WikiClientLibrary/Pages/Queries/Properties/CategoriesPropertyProvider.cs

It's still internal, unfortunately

CXuesong commented 4 years ago

I suppose it was not ready to be exposed to public at that time and then lost in my memory 🌚

Then perhaps I can make this class public now, but you need to use it with caution: do not let dual continuation happen.

CXuesong commented 4 years ago

Released v0.7.1 to fix the continuation parameter issue and exposed CategoriesPropertyProvider.

However, note that the code you have provided in the issue description will still break as I've discussed above.

If you feel the comment above is too overwhelming, perhaps we can chat further about this, on GitHub or not.