canonical / discourse-gatekeeper

Experimental GitHub Action to upload charm documentation to charmhub
Apache License 2.0
7 stars 7 forks source link

Issues with ToC parsing and importing pages. #249

Open a-velasco opened 4 weeks ago

a-velasco commented 4 weeks ago

Issues

  1. The action fails to parse this index.md table of contents.
  2. Running the action on a repo with no /docs folder creates a complete table of contents in index.md, but only imports some of those pages into the repo. Solved: see this comment.

Steps taken

The repository has an existing /docs folder with all the pages published to Charmhub. This migration was done by an old fork of the discourse-gatekeeper action.

Issue 1: Table of contents nesting

We updated our workflow from the fork to canonical/discourse-gatekeeper@stable. Then, we re-triggered the workflow with no changes to discourse or GitHub.

The action failed due to an index/ToC related error (link to logs):

src.exceptions.InputError: A nested item is a reference to a path that is not within the directory of its parent. item=_ParsedListItem(whitespace_count=0, reference_title='Reference', reference_value='reference', rank=19, hidden=False), expected parent path: PosixPath('how-to')

It fails on this item of the contents table.

The same error happens with canonical/discourse-gatekeeper@main.

Issue 2 (solved): Action does not import all documentation pages

See this comment

We deleted the entire `/docs` directory to allow the action to re-generate all docs from scratch with the expected structure. However, the action only downloaded a select few pages. [This older revision](https://github.com/canonical/mysql-router-k8s-operator/tree/1a28465fcd3993d69aaae8999945f1a7085350a8/docs) shows the expected pages. [This newer revision](https://github.com/canonical/mysql-router-k8s-operator/tree/d5b588b0489ae38b0f8a391051f88dec7f028ffe/docs) shows the result of deleting `/docs` and running the action. This might be an entirely different issue altogether. It happens with both `main` and `stable` branches of discourse-gatekeeper. After this, subsequent runs of the action ([example run](https://github.com/canonical/mysql-router-k8s-operator/actions/runs/9545958549/job/26307833786)) yield the following error: ```shell gatekeeper.exceptions.InputError: An item is not a file, directory or external HTTP resource. item=_ParsedListItem(whitespace_count=2, reference_title='1. Introduction', reference_value='tutorial/t-overview.md', rank=1, hidden=False) ``` The item referenced in the error above is the first page that exists in the table of contents but was not correctly imported to GitHub, so I guess that is why it could not be parsed.
a-velasco commented 3 weeks ago

Issue 2 resolved: We had not deleted the /base-content tag when resetting the docs. We made sure to delete it completely before running the action, and it worked as expected (imported all documentation pages)

a-velasco commented 3 weeks ago

RE: Issue 1

I think this might be due to the logic in the _calculate_contents_hierarchy function.

When checking if the current item has a lower whitespace count than expected, it seems to assume the item is just one level away, so it always corrects the hierarchy by subtracting 1.

In a case where it is two or more levels away, its hierarchy value will be incorrect, so the _check_contents_item function will complain that this item isn't part of the group that its hierarchy indicates (i.e. raises that InputError I quoted in the OP)

I've had some success by modifying _calculate_contents_hierarchy like so:

while item:
        # All items in the current directory have been processed
        if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
+           if item.whitespace_count == 0:
+               hierarchy = 0
+               aggregate_dir = Path('.')
+           else:
+               hierarchy = hierarchy - 1
+               parent = parents.pop()
+               aggregate_dir = Path(parent.reference_value).parent
-           hierarchy = hierarchy - 1
-           parent = parents.pop()
-           aggregate_dir = Path(parent.reference_value).parent

This passes the unit tests in tests/unit/test_index_contents_hierarchy.py, and parses the contents as expected when I test it on the repository's CI.

I also added a new test case to simulate this scenario:

        pytest.param(
            (
                item_1 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_1 := "dir1")
                ),
                item_2 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_2 := f"{value_1}/dir2")
                ),
                item_3 := factories.IndexParsedListItemFactory(
                    whitespace_count=2, reference_value=(value_3 := f"{value_2}/file3.md")
                ),
                item_4 := factories.IndexParsedListItemFactory(
                    whitespace_count=0, reference_value=(value_4 := "dir4")
                ),
                item_5 := factories.IndexParsedListItemFactory(
                    whitespace_count=1, reference_value=(value_5 :=f"{value_4}/file5.md")
                ),    
            ),
            ("dir", "dir", "file", "dir", "file"),
            (
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_1.reference_title,
                    reference_value=value_1,
                    rank=item_1.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_2.reference_title,
                    reference_value=value_2,
                    rank=item_2.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=3,
                    reference_title=item_3.reference_title,
                    reference_value=value_3,
                    rank=item_3.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=1,
                    reference_title=item_4.reference_title,
                    reference_value=value_4,
                    rank=item_4.rank,
                ),
                factories.IndexContentsListItemFactory(
                    hierarchy=2,
                    reference_title=item_5.reference_title,
                    reference_value=value_5,
                    rank=item_5.rank,
                ),
            ),
            id="directory following a file nested twice",
        )

@jdkandersson I would love to hear your thoughts about this and get some more insights into how the hierarchy calculation is designed! I haven't raised a PR because I'm not sure if this is the best approach, and I feel like there are probably more considerations to take into account when modifying this logic.

jdkandersson commented 2 weeks ago

I'm taking a look now

jdkandersson commented 2 weeks ago

I think the reduction of level by 1 shouldn't be the problem since, if the whitespace count is still lower than expected, on the next loop it should be popped again. I haven't had the time yet, I would write a test potentially here: https://github.com/canonical/discourse-gatekeeper/blob/50d13cbb5123dd25f87896975044f2e3b3bdf5ca/tests/unit/test_index_contents_parse.py#L522 with the case and see if it fails and check why

jdkandersson commented 2 weeks ago

actually, you already wrote the test, I'll use that as a starting point and see if I can fix the issue

jdkandersson commented 2 weeks ago

Ok, also got the test failure, now need to check why

jdkandersson commented 2 weeks ago

Part of the reason why I'm not sure we should use the proposed code is that it only takes care of the case where the whitespace count is 0, it should also be solved for cases where there is a 2 or more drop in hierarchy when the parent isn't at the root level

a-velasco commented 2 weeks ago

@jdkandersson Thanks a lot for taking a look!

I wrote that code to check whether this was the part of the logic that was failing for multiple level jumps, but if you agree that this is where the hierarchy mis-calculation is happening, then I definitely agree that it should be generalized for other cases.

I'll try to come up with a more clean and general solution, and submit a PR.

jdkandersson commented 2 weeks ago

The solution is quite simple:

if item.whitespace_count < whitespace_expectation_per_level[hierarchy]:
    hierarchy = hierarchy - 1
    parent = parents.pop()
    aggregate_dir = Path(parent.reference_value).parent
    continue

Just needed to add the continue so that the if statement is evaluated again rather than processing continues after only going up one hierarchy

jdkandersson commented 2 weeks ago

Would you like to raise a PR with this and the new test case you have written?

jdkandersson commented 2 weeks ago

Thanks for finding, reporting and even creating a test case for this!

a-velasco commented 2 weeks ago

Ahh no way, I reached almost the same solution but thought the aggregate directory calculation had to be different. Thanks! I'll be happy to raise the PR with the solution + test case :)