Recent change in Wikipedias Layout breaks the table of contents

Builditluc commented 1 year ago

TLDR: The Table of contents won't be visible until the issue with the parser has been resolved. All other features should be working correctly. Check out this issue frequently, as all news about the progress will be posted here

Issue Status

A checklist of what needs to be done can be found here. Currently, the progress in development can be tracked by looking at the following branches related to the issue:

143-rework-backend: This is the main branch of the issue and all finished items on the checklist get merged into this one
(merged #147) 143-create-api-handler: The new API handler is created in this branch
(merged #149) 143-create-parser-system: The new parser system is created in this branch

Recently, the Wikimedia Foundation announced its first change to the design of its site (including the articles) in over a decade, in which they "Put Usability at the Forefront"

One of the changes was a renewal of the table of contents, now on the left of the article. This changed its appearance in the HTML code considerably, revealing some significant flaws in the parsing design (We've always assumed that the table of contents would be at a specific location in a specific format. This is now no longer the case).

I've already come up with a solution to this problem. We can get the table of contents easily from the Wikipedia JSON API, already formatted with all of the required data. A quick example of the returned table of contents (internally called sections) for the article Meaning

{
    "parse": {
        "title": "Meaning",
        "pageid": 18916,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "Arts and entertainment",
                "number": "1",
                "index": "1",
                "fromtitle": "Meaning",
                "byteoffset": 785,
                "anchor": "Arts_and_entertainment",
                "linkAnchor": "Arts_and_entertainment"
            },
            {
                "toclevel": 1,
                "level": "2",
                "line": "See also",
                "number": "2",
                "index": "2",
                "fromtitle": "Meaning",
                "byteoffset": 1238,
                "anchor": "See_also",
                "linkAnchor": "See_also"
            }
        ],
        "showtoc": ""
    }
}

Response of the API call: https://en.wikipedia.org/w/api.php?format=json&action=parse&page=Meaning&prop=sections

As we can see, the API gives us all of the necessary data (level, number, linkAnchor). To implement this change we would, however, have to rewrite the whole table of contents section in the parser. Because we only get the id of the span that contains the header, we would also have to rewrite the header handling system. This requires tapping into the code of the UI.

These are some pretty big changes, but it'll be a high priority for me to fix this as soon as possible. I'll start with designing the interfaces for the new parser and once the design is fleshed out, I'll start implementing the changes.

Builditluc commented 1 year ago

Table of Contents redesign

Data Structure and Representation

First, we need to know exactly what information we want in the end. We need to ask ourselves, what information do we need for our current functionality to work (Don't want to overcomplicate this)? Well, we need the following things:

Some sort of identifier (maybe the id of the page the toc refers to)
The title, drawn at the top of the list (usually the page title)
Well, a list of all of the actual items

For every item, we would need the following things:

Some identifier (used internally for things like jumping to the header with a specific id)
The level (can be used to identify the type of header, maybe even represented as an enum)
The Text or title of the item
Some way of linking to a specific position in the article

When we look at what the API gives us, we can see that everything we need is given to us Sample API result:

{
    "parse": {
        "title": "GitHub",
        "pageid": 18545292,
        "sections": [
            {
                "toclevel": 1,
                "level": "2",
                "line": "History",
                "number": "1",
                "index": "1",
                "fromtitle": "GitHub",
                "byteoffset": 5788,
                "anchor": "History",
                "linkAnchor": "History"
            },
            ...
        ],
        ...
    }
}

When now try to put these requirements into a rust data type, we come out with the following (this is just a first draft. variable names and types are not fixed)

struct TableOfContents {
    id: usize,
    title: String,
    items: Vec<TableOfContentsItem>
}

struct TableOfContentsItem {
    id: usize,
    level: usize, // or an Enum, don't know yet
    text: String,
    anchor: String,
}

The anchor would be the id of the header that this item is linked to. When parsing the article, we can then just look at what TableOfContentsItem has that id and then save the id in the ArticleElement. We've just found out the first API requirement we need. Some sort of lookup that returns the correct item id when given an anchor.

Integration with the Article

The Problem

The current implementation separates the table of contents from the article. The UI has to fetch the table of contents from the article and then do its actions there manually (matching the items against headers, fetching the title, formatting the item title, etc.). This adds a lot of overhead in the UI that's not necessary (we can add functions for everything the UI needs from the article to Article itself, so we don't need to interact with the underlying data directly (this also allows us to change the structure and data representation, like adding more fields, without having to rewrite the UI.

When we look at the things we currently do with the table of contents, we can see that it's all over the place. For example, look at what happens when you click on an item in the toc:

We iterate through every item in the SelectView (not the actual toc data) and we then compare the text of each item with the text of the item we clicked.
We then fetch the index of said item in the SelectView
We then call ArticleView::select_header with this index

There are quite a few things wrong and just dangerous about this. First, the whole thing depends on the headers being in the same order everywhere (in the Article and the SelectView). We don't use identifiers. No. We use the index.... That's just a disaster waiting to happen. Also, we don't work with the actual data here, just the items stored in the SelectView and some coordinates in an array on the Articles side.

Redesigned Solution

The solution for this would be quite simple actually. We use identifiers to refer to an item instead of its index. Currently, we need the following functionality:

Get the level and text from an item given its id
Get the anchor for an item given its id (needed for the linking of an ArticleElement to a TableOfContentsItem)
Get the coordinates for an item given its id (this functionality would be in the ArticleContent and not the Article itself, because these depend on the size of the view (can change)

As we can see, that's not too complicated.

One thing to note about the text of these items. Because the displayed text can be changed in the config, we format the text right in the parsing stage. This configuration change only configures the text displayed in the SelectView. So why would we need to complicate the parser when we can just let the UI handle formatting?

Naming

I believe this is the right time to change the naming of some of the structs and variables of the table of contents. A better name for it would be Section or Sections. This is also the name the Wikipedia API uses for the toc. This would mean that the Article struct would now have an attribute called sections that stores the individual sections of the article, called Section. This would remove the need for another struct called TableOfContents entirely because all of the logic is going to be moved into the main Article struct.

Builditluc commented 1 year ago

Making API Calls

One Struct handles it all

Currently, interacting with the Wikipedia API happens directly inside of the ArticleBuilder and SearchBuilder respectively. This means that we need two implementations of validating parameters, checking for errors in the responses, and making the requests. One way to solve this would be to just create helper functions for these things but I believe it would be best to move the direct API interactions into a separate struct that handles it all (parameter validation, making requests, maybe in the future even caching). Because this struct interacts with the Wikimedia API (basically Wikimedia is the database Wikipedia is built on. That means we can handle all kinds of sites that use Wikimedia and have its API endpoint), a good name for it would be WikimediaAPI. I don't want to overcomplicate things so the API Handler would only handle the things we need for now, easily expanded once needed.

Structure and current Requirements

Currently, we only need four different main things for the handler to do:

Search for an Article
Continue Searching for an Article
Fetch an Article given its title (used by the links)
Fetch an Article given its id (used by the search results)

When we only implement these four things, we not only keep the struct clean and simple, but we can expand it in the future. The handler would then have private helper functions that validate arguments and check for wikimedia errors. The return types for each of these functions would be json, keeping it simple (we don't have to create a struct for each API response and don't have to deal with many many Option<T> attributes).

For the error handling, we would create an WikimediaError enum with all of the different errors in it. That way we can simplify the error handling in the UI. That means each of the endpoints returns a Result<Json, WikimediaError> (not sure yet how the raw json type is named).

Testing

Creating one struct for handling all of the wikimedia API interactions enables us to test it better. We can mock the Wikipedia API for example and then test all the different things that can happen for each of the requirements.

Builditluc commented 1 year ago

Things that need to be done:

[ ] write tests for the API handler
[x] create an API handler
[x] rework the parser system
[ ] implement the reworked parser into the UI
[ ] remove the old code

Builditluc / wiki-tui