DawnbrandBots / yaml-yugipedia

An automatically-updated collection of wikitexts from Yugipedia. Part of YAML Yugi.
GNU Lesser General Public License v3.0
4 stars 1 forks source link

Query for set information #2

Open kevinlul opened 2 years ago

kevinlul commented 2 years ago

Collecting set information is the last piece for YAML Yugi to exceed parity with other solutions. Unlike the other data collected so far, which are contained in flat categories, sets are indexed on Yugipedia in hierarchical categories. This means that instead of a target category for sets directly containing an article about a set, categories may be nested. When querying the MediaWiki API, only the immediate members of a category are returned, including the names of child categories, but the members of those child categories are not returned. Therefore, new code is required in order to download entire category hierarchies and subscribe to updates on them. Category hierarchies are allowed to contain cycles, and while this is not expected of the categories for sets, our code should be correct even if cycles are encountered and not fall into an infinite loop.

Design

Either create or extend the current full download script to recursively download a targeted category, without falling into infinite loops. For example, after fetching https://yugipedia.com/api.php?action=query&redirects=true&generator=categorymembers&prop=revisions&rvprop=content&format=json&formatversion=2&gcmlimit=50&gcmtitle=Category:Yu-Gi-Oh!_Master_Duel_sets, the the ns=14 category items in the response should be stored in a ordered set for additional follow-up requests once the current category is completely downloaded.

To subscribe to incremental updates, the existing script can be used, but each time, it should be called with all the known descendant categories cached from the last full download, in addition to the top-level category itself. This is because the MediaWiki API only provides the immediate parent categories of an article, not all ancestor categories.

Subtasks

kevinlul commented 2 years ago

These are fairly scattered due to the varying types of product that exist. Examples:

kevinlul commented 1 year ago

Recursive on

Top level for exploration: https://yugipedia.com/wiki/Category:Sets

kevinlul commented 8 months ago

Recursive full download logic is okay. Capture the ns=14 category items in the response https://yugipedia.com/api.php?action=query&redirects=true&generator=categorymembers&prop=revisions&rvprop=content&format=json&formatversion=2&gcmlimit=50&gcmtitle=Category:Yu-Gi-Oh!_Master_Duel_sets and store them for additional follow-up requests. It's less clear whether just the top-level category is effective for recent changes, or if all target categories need to be queried.

kevinlul commented 8 months ago

Must list all immediate containing categories to be effective.

kevinlul commented 5 months ago

Notes:

Recursive full download needs to return a list of found categories (ns=14) after each page downloaded In the main loop, this is appended to an OrderedSet There's an additional outer loop iterating over the OrderedSet, thus fetching all categories, without infinite recursion in the case of cycles, because the category will already be in the set and have been iterated

Should I just add the second return value for the list of categories or switch to OOP?

xyj-3 commented 1 month ago

Is OrderedSet a specific thing? Also what do you mean by "add the second return value for the list of categories or switch to OOP"?

Also do you want the new downloaded files to be flat in the top level category folder or nested?

xyj-3 commented 1 month ago

How does gcmcontinue and grccontinue work, like when are you using it and what value do you give it

kevinlul commented 1 month ago

I'm describing the changes that need to happen to the main logic in the download function in https://github.com/DawnbrandBots/yaml-yugipedia/blob/master/src/utils.py

Currently the category is specified to the MediaWiki API by the gcmtitle URL parameter in main.py. However, this only retrieves direct members of the category, so the download logic needs to keep track of child category pages that were retrieved, to be downloaded by another request to the MediaWiki API. I mentioned an OrderedSet because that is one way to keep track of the categories already downloaded and newly discovered in order to avoid infinite looping.

gcmcontinue and grccontinue are pagination tokens in the response from MediaWiki APIs when a generator is used, when the results don't fit on a single page of results. In the download scripts, this is populated from the previous request so it downloads all pages, but can also be provided on the command-line to restart a previous set of downloads from the middle. The parameter varies by script. For main.py, the generator CategoryMembers is used, so the parameter is GCMcontinue. For incremental.py, the generator RecentChanges is used, so the parameter is GRCcontinue.

https://yugipedia.com/api.php https://www.mediawiki.org/wiki/API:Query#query:generator https://www.mediawiki.org/w/api.php?action=help&modules=query%2Bcategorymembers https://www.mediawiki.org/w/api.php?action=help&modules=query%2Brecentchanges

xyj-3 commented 1 month ago

I got it to work with recursion and a second return value but it doesn't look that nice right now so I'm considering restructuring it.

The biggest issue so far is actually getting an identifier for the top category for preventing loops. The generator=categorymembers doesn't return any info about the category itself.

I figure you can use pageid or title to track if there is looping. So in that case it looks like you either have to

What do you think? Do you have any preferences because otherwise I'm probably picking making another request for every category.

kevinlul commented 1 month ago

Feel free to restructure. I already anticipated it would be necessary and there's actually very little code in this repository. The only interface that needs to be respected for full downloads is the command-line interface. Everything else is an implementation detail.