Issues with ToC parsing and importing pages.

a-velasco commented 4 weeks ago

Issues

The action fails to parse this index.md table of contents.
~~Running the action on a repo with no /docs folder creates a complete table of contents in index.md, but only imports some of those pages into the repo.~~ Solved: see this comment.

Steps taken

The repository has an existing /docs folder with all the pages published to Charmhub. This migration was done by an old fork of the discourse-gatekeeper action.

Issue 1: Table of contents nesting

We updated our workflow from the fork to canonical/discourse-gatekeeper@stable. Then, we re-triggered the workflow with no changes to discourse or GitHub.

The action failed due to an index/ToC related error (link to logs):

src.exceptions.InputError: A nested item is a reference to a path that is not within the directory of its parent. item=_ParsedListItem(whitespace_count=0, reference_title='Reference', reference_value='reference', rank=19, hidden=False), expected parent path: PosixPath('how-to')

It fails on this item of the contents table.

The same error happens with canonical/discourse-gatekeeper@main.

Issue 2 (solved): Action does not import all documentation pages

See this comment

We deleted the entire `/docs` directory to allow the action to re-generate all docs from scratch with the expected structure. However, the action only downloaded a select few pages. [This older revision](https://github.com/canonical/mysql-router-k8s-operator/tree/1a28465fcd3993d69aaae8999945f1a7085350a8/docs) shows the expected pages. [This newer revision](https://github.com/canonical/mysql-router-k8s-operator/tree/d5b588b0489ae38b0f8a391051f88dec7f028ffe/docs) shows the result of deleting `/docs` and running the action. This might be an entirely different issue altogether. It happens with both `main` and `stable` branches of discourse-gatekeeper. After this, subsequent runs of the action ([example run](https://github.com/canonical/mysql-router-k8s-operator/actions/runs/9545958549/job/26307833786)) yield the following error: ```shell gatekeeper.exceptions.InputError: An item is not a file, directory or external HTTP resource. item=_ParsedListItem(whitespace_count=2, reference_title='1. Introduction', reference_value='tutorial/t-overview.md', rank=1, hidden=False) ``` The item referenced in the error above is the first page that exists in the table of contents but was not correctly imported to GitHub, so I guess that is why it could not be parsed.

a-velasco commented 3 weeks ago

Issue 2 resolved: We had not deleted the /base-content tag when resetting the docs. We made sure to delete it completely before running the action, and it worked as expected (imported all documentation pages)

a-velasco commented 3 weeks ago

RE: Issue 1

I think this might be due to the logic in the _calculate_contents_hierarchy function.

When checking if the current item has a lower whitespace count than expected, it seems to assume the item is just one level away, so it always corrects the hierarchy by subtracting 1.

In a case where it is two or more levels away, its hierarchy value will be incorrect, so the _check_contents_item function will complain that this item isn't part of the group that its hierarchy indicates (i.e. raises that InputError I quoted in the OP)

I've had some success by modifying _calculate_contents_hierarchy like so:

while item:
        # All items in the current directory have been processed
        if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
+           if item.whitespace_count == 0:
+               hierarchy = 0
+               aggregate_dir = Path('.')
+           else:
+               hierarchy = hierarchy - 1
+               parent = parents.pop()
+               aggregate_dir = Path(parent.reference_value).parent
-           hierarchy = hierarchy - 1
-           parent = parents.pop()
-           aggregate_dir = Path(parent.reference_value).parent

This passes the unit tests in tests/unit/test_index_contents_hierarchy.py, and parses the contents as expected when I test it on the repository's CI.

I also added a new test case to simulate this scenario:

        pytest.param(
            (
                item_1 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_1 := "dir1")
                ),
                item_2 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_2 := f"{value_1}/dir2")
                ),
                item_3 := factories.IndexParsedListItemFactory(
                    whitespace_count=2, reference_value=(value_3 := f"{value_2}/file3.md")
                ),
                item_4 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_4 := "dir4")
                ),
                item_5 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_5 :=f"{value_4}/file5.md")
                ),    
            ),
            ("dir", "dir", "file", "dir", "file"),
            (
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_1.reference_title,
                    reference_value=value_1,
                    rank=item_1.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_2.reference_title,
                    reference_value=value_2,
                    rank=item_2.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=3,
                    reference_title=item_3.reference_title,
                    reference_value=value_3,
                    rank=item_3.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_4.reference_title,
                    reference_value=value_4,
                    rank=item_4.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_5.reference_title,
                    reference_value=value_5,
                    rank=item_5.rank,
                ),
            ),
            id="directory following a file nested twice",
        )

@jdkandersson I would love to hear your thoughts about this and get some more insights into how the hierarchy calculation is designed! I haven't raised a PR because I'm not sure if this is the best approach, and I feel like there are probably more considerations to take into account when modifying this logic.