QubesOS / qubes-issues

The Qubes OS Project issue tracker
https://www.qubes-os.org/doc/issue-tracking/
536 stars 48 forks source link

Automatically generate as much of the documentation's "metadata" as possible #6701

Closed andrewdavidwong closed 3 years ago

andrewdavidwong commented 3 years ago

The problem you're addressing (if any)

Doc maintenance is a painful burden, and it's only getting worse. There are many places where things have to kept in sync manually, and there's no good reason for this when it could be done automatically.

Describe the solution you'd like

Some specific examples:

Where is the value to a user, and who might that user be?

This is mainly of value to people who work on the docs, including contributors. It would greatly simplify things to handle this stuff automatically. Readers would also benefit from a more robust doc system with fewer errors (due to less room for human error to introduce problems).

Describe alternatives you've considered

There are existing doc management systems that have already solved these problem, but we're probably beyond that point.

Relevant documentation you've consulted

N/A

Related, non-duplicate issues

https://github.com/QubesOS/qubes-issues/issues/5308

andrewdavidwong commented 3 years ago

@tokideveloper, @maiska: Would this interfere with your website localization work?

andrewdavidwong commented 3 years ago

Ideally, we could just de-slugify the filename to form the title automatically, for example.

This may not be possible, since slugifying is, by its nature, a lossy conversion. If not feasible, perhaps we could have a script that automatically renames every file with its slugified page title. (In cases in which the name is already correct, the rename would be a NOP.)

I'm still going to try it, though, since we might get sufficiently good results from simply replacing hyphens with spaces and capitalizing the first letter, for example.

Update: Tested it out locally. I think we need more control over capitalization than this would allow. For example, "XSA" in a title needs to be in all caps, which this would not support. Scrapping this specific idea, but it's still better to reduce the duplication from three things down to two. Filenames are ineliminable, and having the title in the YAML frontmatter is useful, so let's get rid of h1 headings in the body text.

tokideveloper commented 3 years ago

@tokideveloper, @maiska: Would this interfere with your website localization work?

Here are my IMHO thoughts:

Redundancy

Permalink

Breadcrumb link trail

Title

YAML front matter in general

Doc index

Having every link and title for the doc index in a YAML file is redundant. We should just generate this from the directory structure and filenames.

Summary

Ideas

marmarek commented 3 years ago

I agree with @tokideveloper regarding permalinks - while they may look like a duplication, they are useful. Similarly for the title - slugifying is lossy. Keeping them in sync in the other way (automatically rename based on the title) may make reading git history a bit harder, especially with the web interface (not a huge issue, but still an issue). But having both title: in YAML and h1 should not be needed - the latter should be generated based on the former.

I think the biggest issue is the doc index - it duplicates page titles (in a separate file, so very easy to desynchronize), needs to be updated each time some page is added/moved/removed and generally sounds like something that should be generated, not hand-crafted.

andrewdavidwong commented 3 years ago

Permalink

  • Currently, our localization scripts heavily rely on explicit permalinks. Making them implicit means that we have to re-calc them which is not so nice and error-prone.

It shouldn't be error-prone, since the re-calc is deterministic and allows no room for human error.

  • An explicit permalink is good for translators since they can find the page they are translating relatively easy by entering the permalink in their browser's address bar. This is in a sense necessary since the directory path is lost (in a sense) when uploading the related MD file to Transifex.

Oof. That sounds like a shortcoming of the translation workflow. I don't think it's a good idea for translators to rely on something like a permalink in a YAML header. Most other projects don't have that. How do they keep things organized?

  • A file with an explicit permalink can be sent via email without loosing the context. Note that the permalink (and also the directory path) is in a sense a context hint.

Except the permalink almost never provides any context. Most of our permalinks are just /doc/<title>.

  • Note that it's a little bit harder to move a page to another URL with implicit permalinks. The old implicit permalink must be generated (or manually copied from the address bar) and put to redirect_from.

True.

  • Does the implicit permalink contain the .md file ending? If yes then it's not so nice.

No.

Breadcrumb link trail

  • Breadcrumbs are possible to implement even with permalink being explicitly specified. Independently from the permalink, we could use the page.relative_path Jekyll variable and split it up to generate breadcrumbs. (*)

But then it's yet another thing to keep manually in sync, or else it will make no sense. It makes no sense for the breadcrumb trail not to match the URL path, and it's definitely not worth it if I have to keep both in sync manually.

page.relative_path Jekyll variable

No such variable found. Maybe you meant page.path? But then I don't understand the suggestion.

YAML front matter in general

  • How many times has someone to touch the YAML front matter of an MD file during its life cycle so that it could get out of sync? Has this happened in the past?

It happens all the time, and there is a lot of desync in qubes-doc right now. It's not just about touching the YAML frontmatter. That's just one part that can get desynced from other parts.

Doc index

Having every link and title for the doc index in a YAML file is redundant. We should just generate this from the directory structure and filenames.

  • This should be feasible. See (*) above.

See above. I don't understand this suggestion.

Ideas

  • Concerning the permalink: Would it be feasible to move all *.md files to their own directories? Like moving /doc/user/common-tasks/copy-paste.md to /doc/user/common-tasks/copy-paste/index.md such that the directory path equals the URL path? Maybe this could help making the permalink redundant. (For translation purposes, index.md is not a good name since all files uploaded to Tranisfex would look the same.)

Aren't you contradicting your own earlier points? I don't see how this would help.

andrewdavidwong commented 3 years ago

I agree with @tokideveloper regarding permalinks - while they may look like a duplication, they are useful.

I don't deny that they are useful. I'm just pointing out that they make maintenance and contributing burdensome.

How do you propose we keep permalinks in sync with the directory structure?

Or, if we can't/won't do that, then what if we just have two directories in qubes-doc -- user and developer -- and dump all *md files into those two directories without any further subdirectories?

(I think we can simply delete external after a while. There is no need to maintain those redirects forever. They were only intended to be temporary anyway.)

andrewdavidwong commented 3 years ago

I think the biggest issue is the doc index - it duplicates page titles (in a separate file, so very easy to desynchronize), needs to be updated each time some page is added/moved/removed and generally sounds like something that should be generated, not hand-crafted.

I've spent many, many hours trying to figure out a way to do this, and I haven't been able to come up with anything that works well enough. The main problem is that expressions like {% if page.path contains {{ section.dir }} %} simply don't work. They generate nothing, presumably because contains requires a string and won't accept a variable there. So we'd have to hard-code each section/directory (and then we could at least auto-gen the list of pages under it), but that's still too much hardcoding, unless we radically flatten the doc structure (as mentioned above).

It also doesn't help that I don't understand most of the new localization code.

andrewdavidwong commented 3 years ago

Another idea would be to have only links in the YAML file.

Good idea. It's only halfway but would still be an improvement. I'll see if I can get it to work.

tokideveloper commented 3 years ago

Permalink

  • Currently, our localization scripts heavily rely on explicit permalinks. Making them implicit means that we have to re-calc them which is not so nice and error-prone.

It shouldn't be error-prone, since the re-calc is deterministic and allows no room for human error.

It's deterministic but I thought of the function "slugify". Slugifying the title would mean that we had to know how exactly the "slugify" function works. It's not clear to me yet. But if we don't need to slugify anything then it should be feasible and not error-prone.

  • An explicit permalink is good for translators since they can find the page they are translating relatively easy by entering the permalink in their browser's address bar. This is in a sense necessary since the directory path is lost (in a sense) when uploading the related MD file to Transifex.

Oof. That sounds like a shortcoming of the translation workflow. I don't think it's a good idea for translators to rely on something like a permalink in a YAML header. Most other projects don't have that. How do they keep things organized?

Okay, the Transifex config file exists where the mapping is listed. But here, I thought of an easy way for translators to get the website they are translating. It's not mandatory to have such a nice feature (and there may be other ways) but I like it.

  • A file with an explicit permalink can be sent via email without loosing the context. Note that the permalink (and also the directory path) is in a sense a context hint.

Except the permalink almost never provides any context. Most of our permalinks are just /doc/<title>.

I agree.

  • Note that it's a little bit harder to move a page to another URL with implicit permalinks. The old implicit permalink must be generated (or manually copied from the address bar) and put to redirect_from.

True.

  • Does the implicit permalink contain the .md file ending? If yes then it's not so nice.

No.

Breadcrumb link trail

  • Breadcrumbs are possible to implement even with permalink being explicitly specified. Independently from the permalink, we could use the page.relative_path Jekyll variable and split it up to generate breadcrumbs. (*)

But then it's yet another thing to keep manually in sync, or else it will make no sense. It makes no sense for the breadcrumb trail not to match the URL path, and it's definitely not worth it if I have to keep both in sync manually.

Oh, sorry, I thought of the case when the permalink matches the file path but I didn't write it. Sorry!

Actually, the suggestion is trivial: If the permalinks match the file paths then the variables page.url and page.path are almost the same and thus, can be used interchangeably. This is what I wanted to say.

page.relative_path Jekyll variable

No such variable found. Maybe you meant page.path? But then I don't understand the suggestion.

I took it from here.

YAML front matter in general

  • How many times has someone to touch the YAML front matter of an MD file during its life cycle so that it could get out of sync? Has this happened in the past?

It happens all the time, and there is a lot of desync in qubes-doc right now. It's not just about touching the YAML frontmatter. That's just one part that can get desynced from other parts.

Okay. This should be solved.

Doc index

Having every link and title for the doc index in a YAML file is redundant. We should just generate this from the directory structure and filenames.

  • This should be feasible. See (*) above.

See above. I don't understand this suggestion.

See my sorry above.

Ideas

  • Concerning the permalink: Would it be feasible to move all *.md files to their own directories? Like moving /doc/user/common-tasks/copy-paste.md to /doc/user/common-tasks/copy-paste/index.md such that the directory path equals the URL path? Maybe this could help making the permalink redundant. (For translation purposes, index.md is not a good name since all files uploaded to Tranisfex would look the same.)

Aren't you contradicting your own earlier points? I don't see how this would help.

In a sense, I'm contradicting my earlier points, yes. Actually, I tried to make a step towards you and find a solution.

tokideveloper commented 3 years ago

Describe the solution you'd like

Some specific examples:

  • Using permalink: in the YAML frontmatter is worse than just letting Jekyll generate the permalink based on the directory path and filename. Our Jekyll config is already set to do this. We just need to delete all the permalink: lines and set up redirects from the existing URLs. This would also allow us to auto-generate a breadcrumb link trail at the top of each doc page, which would make navigation easier.

Instead of letting Jekyll generate the implicit permalinks, we could write an extern (Python?) script that does the work and produces explicit permalinks based on the directory path and filename (and puts obsolete permalinks into redirect_from). This way, we could combine the best of both worlds. The script should then be run by Travis.

andrewdavidwong commented 3 years ago

It's deterministic but I thought of the function "slugify". Slugifying the title would mean that we had to know how exactly the "slugify" function works. It's not clear to me yet. But if we don't need to slugify anything then it should be feasible and not error-prone.

Oh, slugifying is a pretty common and straightforward thing. Have a look here: https://jekyllrb.com/docs/liquid/filters/

But then it's yet another thing to keep manually in sync, or else it will make no sense. It makes no sense for the breadcrumb trail not to match the URL path, and it's definitely not worth it if I have to keep both in sync manually.

Oh, sorry, I thought of the case when the permalink matches the file path but I didn't write it. Sorry!

Actually, the suggestion is trivial: If the permalinks match the file paths then the variables page.url and page.path are almost the same and thus, can be used interchangeably. This is what I wanted to say.

The problem is that they currently almost never match, but it would be more organized and consistent if they did.

page.relative_path Jekyll variable

No such variable found. Maybe you meant page.path? But then I don't understand the suggestion.

I took it from here.

FWIW, page.path and page.relative_path return the same result for our doc pages (just tested).

In any case, this only works well when the URL and directory path are the same.

Ideas

  • Concerning the permalink: Would it be feasible to move all *.md files to their own directories? Like moving /doc/user/common-tasks/copy-paste.md to /doc/user/common-tasks/copy-paste/index.md such that the directory path equals the URL path? Maybe this could help making the permalink redundant. (For translation purposes, index.md is not a good name since all files uploaded to Tranisfex would look the same.)

Aren't you contradicting your own earlier points? I don't see how this would help.

In a sense, I'm contradicting my earlier points, yes. Actually, I tried to make a step towards you and find a solution.

Thank you, but I don't think this would be necessary, because auto-generating the URL from the path works without doing this.


Instead of letting Jekyll generate the implicit permalinks, we could write an extern (Python?) script that does the work and produces explicit permalinks based on the directory path and filename (and puts obsolete permalinks into redirect_from). This way, we could combine the best of both worlds. The script should then be run by Travis.

Yes, I think something like this would be very helpful.

tokideveloper commented 3 years ago

I think the biggest issue is the doc index - it duplicates page titles (in a separate file, so very easy to desynchronize), needs to be updated each time some page is added/moved/removed and generally sounds like something that should be generated, not hand-crafted.

I've spent many, many hours trying to figure out a way to do this, and I haven't been able to come up with anything that works well enough. The main problem is that expressions like {% if page.path contains {{ section.dir }} %} simply don't work. They generate nothing, presumably because contains requires a string and won't accept a variable there. So we'd have to hard-code each section/directory (and then we could at least auto-gen the list of pages under it), but that's still too much hardcoding, unless we radically flatten the doc structure (as mentioned above).

Sadly, the Liquid Template Language is very limited. Maybe, the doc index could be produced via an extern script (run by Travis)?

andrewdavidwong commented 3 years ago

Sadly, the Liquid Template Language is very limited. Maybe, the doc index could be produced via an extern script (run by Travis)?

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

tokideveloper commented 3 years ago

Instead of letting Jekyll generate the implicit permalinks, we could write an extern (Python?) script that does the work and produces explicit permalinks based on the directory path and filename (and puts obsolete permalinks into redirect_from). This way, we could combine the best of both worlds. The script should then be run by Travis.

Yes, I think something like this would be very helpful.

Then let's write one! But first, let's ask @marmarek if it's fine.


Sadly, the Liquid Template Language is very limited. Maybe, the doc index could be produced via an extern script (run by Travis)?

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

If you are pleased with the result then we could use it. If not then: Let's write an external script! ;-)

andrewdavidwong commented 3 years ago

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

If you are pleased with the result then we could use it. If not then: Let's write an external script! ;-)

They would be complementary, not exclusive. ;)

tokideveloper commented 3 years ago

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

If you are pleased with the result then we could use it. If not then: Let's write an external script! ;-)

They would be complementary, not exclusive. ;)

Ah, okay. I had an external script in mind which produces a plain Markdown file showing the index without magic in it. I didn't think of the YAML file containing only the URLs. Sorry, I should be more expressive.

andrewdavidwong commented 3 years ago

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

If you are pleased with the result then we could use it. If not then: Let's write an external script! ;-)

They would be complementary, not exclusive. ;)

Ah, okay. I had an external script in mind which produces a plain Markdown file showing the index without magic in it.

Not sure what you mean. What magic?

(BTW, the current index is a YAML file, not a Markdown file.)

I didn't think of the YAML file containing only the URLs.

The idea is that the index contains the URLs of doc pages but not the titles. Title of each page is grabbed from the title: in the YAML frontmatter of each doc file.

However, this means we have to manually edit the permalink of every single doc file. If we ever wanted to change the permalinks to match the directory structure, for example, this means we would have to edit every single one. This is where your python script would come in handy!

However, now that I think about it, I'm not sure sure if we really want to change all the permalinks to match the directory structure, since we often do not have a page for each intermediate step. For example, consider this hypothetical URL:

/doc/user/troubleshooting/disk-troubleshooting/

A visitor might expect something at each of these URLs:

/doc/user/
/doc/user/troubleshooting/

But there's nothing at either of those, because they're simply sections on the /doc/ page, and we probably don't want to bother to make them. So maybe the breadcrumb and change-permalinks-to-match-directory-structure ideas aren't worth it.

marmarek commented 3 years ago

Note that it's a little bit harder to move a page to another URL with implicit permalinks. The old implicit permalink must be generated (or manually copied from the address bar) and put to redirect_from.

That actually I worry about quite a lot. If renaming file or changing its title (depending on how title will be related to the file name) will invalidate its original URL, it will be very easy to break links (our CI will detect internal issues, but we can't possibly find all the links from outside of our website).

The idea of having explicit permalink, that is kept in sync with directory structure with a script (and that script also cares about adding relevant redirect_from) may work indeed. One issue with that is Travis^WGitlab-CI can't commit things. But I think we can make it post a review with required changes as "suggestions". This may be not entirely trivial, for example it needs to avoid posting the same suggestion over and over... Alternative is just to complain when things are broken, suggesting a change in just job log.

As a general direction of permalinks matching directory structure, I'm not really sure. On one hand, they will ease finding actual source page by just looking at the URL (something that I personally find hard with the current state), and also will make it clearer which documentation section is it (most useful for "user" / "developer" distinction, IMO less about more detailed categories). On the other hand, they will be longer, and as Andrew just pointed out, not always intermediate parts makes sense. Plus, attempting to keep them in sync with the directory structure (with whatever method we choose) will be some effort, even if just one time writing a script.

As for the index, per-section generated index, like you did in https://github.com/QubesOS/qubes-doc/commit/68f6f96220038cd6330f4fdb7608b78ce8f2cb51 would be a massive improvement already. The only remaining manual work would be adjusting sections themselves (reordering them, changing their titles etc). A lot less frequent work. I would call it good enough. Anyway, you can try {% if page.path contains section.dir %} (without inner {{ ... }}).

andrewdavidwong commented 3 years ago

Anyway, you can try {% if page.path contains section.dir %} (without inner {{ ... }}).

I also tried that. Didn't work, presumably because it only accepts a string.

But it's fine. I have a good-enough solution for the time being.

tokideveloper commented 3 years ago

I was able to make some decent progress using your suggestion (remove the titles; use only the URLs). It's partial automation at least.

If you are pleased with the result then we could use it. If not then: Let's write an external script! ;-)

They would be complementary, not exclusive. ;)

Ah, okay. I had an external script in mind which produces a plain Markdown file showing the index without magic in it.

Not sure what you mean. What magic?

Liquid code.


Note that it's a little bit harder to move a page to another URL with implicit permalinks. The old implicit permalink must be generated (or manually copied from the address bar) and put to redirect_from.

That actually I worry about quite a lot. If renaming file or changing its title (depending on how title will be related to the file name) will invalidate its original URL, it will be very easy to break links (our CI will detect internal issues, but we can't possibly find all the links from outside of our website).

Now, I also think of our translated *.md files. Their permalinks look like /de/doc/... and /pl/doc/.... But the files don't reside at /de/doc/... or /pl/doc/... in the doc repo. So, permalinks can't reflect the directory structure of our translated files which would lead to inconsistency with the general idea of permalinks matching directory structure.

The idea of having explicit permalink, that is kept in sync with directory structure with a script (and that script also cares about adding relevant redirect_from) may work indeed. One issue with that is Travis^WGitlab-CI can't commit things. But I think we can make it post a review with required changes as "suggestions". This may be not entirely trivial, for example it needs to avoid posting the same suggestion over and over... Alternative is just to complain when things are broken, suggesting a change in just job log.

Sounds complicated and like another downside.

As for the index, per-section generated index, like you did in QubesOS/qubes-doc@68f6f96 would be a massive improvement already. The only remaining manual work would be adjusting sections themselves (reordering them, changing their titles etc). A lot less frequent work. I would call it good enough.

This is something I still didn't get yet. If we have Liquid code that generates the index items then the result exists only at Jekyll runtime and thus we can't change the ordering manually afterwards. Or am I wrong? Can someone explain it, please?

andrewdavidwong commented 3 years ago

Liquid code.

Not allowed to use liquid inside of YAML anyway.

This is something I still didn't get yet. If we have Liquid code that generates the index items then the result exists only at Jekyll runtime and thus we can't change the ordering manually afterwards. Or am I wrong? Can someone explain it, please?

Correct. This is a downside of a fully-auto-generated doc index: no control over the order of the links. My current approach (inspired by you) solves this by only partially automating it (hand-crafted list of links, in order, with the rest automated).

tokideveloper commented 3 years ago

Liquid code.

Not allowed to use liquid inside of YAML anyway.

No, no, no, you got we wrong. I thought of an external script that produces the doc index page in plain Markdown without Liquid code. Currently, the doc index page is generated via the layout doc-index which contains Liquid code. I, instead, thought of moving that Liquid code to an external (Python?) script because of the contains issue you mentioned. But since you found a solution to that issue, it's no longer necessary.

This is something I still didn't get yet. If we have Liquid code that generates the index items then the result exists only at Jekyll runtime and thus we can't change the ordering manually afterwards. Or am I wrong? Can someone explain it, please?

Correct. This is a downside of a fully-auto-generated doc index: no control over the order of the links. My current approach (inspired by you) solves this by only partially automating it (hand-crafted list of links, in order, with the rest automated).

Thank you! :+1:

andrewdavidwong commented 3 years ago

I thought of a better way to create breadcrumb navigation trails: https://github.com/QubesOS/qubesos.github.io/commit/07dbc013e425ff2fafd26a1a51fbcd0a45f6b8f2.

ninavizz commented 3 years ago

Having title: in the YAML frontmatter and having an h1 heading in the body and having the title in the filename is triply redundant and often gets out of sync. Better to use just one. Ideally, we could just de-slugify the filename to form the title automatically, for example.

It's not redundant. I totally respect the need for easy maintenance. Styleguides exist, to guide how people should name things. There is no writing styleguide to guide the docs, that I know of. One should really exist, to contain the chaos you're trying to avoid.

andrewdavidwong commented 3 years ago

Having title: in the YAML frontmatter and having an h1 heading in the body and having the title in the filename is triply redundant and often gets out of sync. Better to use just one. Ideally, we could just de-slugify the filename to form the title automatically, for example.

It's not redundant. I totally respect the need for easy maintenance. Styleguides exist, to guide how people should name things. There is no writing styleguide to guide the docs, that I know of. One should really exist, to contain the chaos you're trying to avoid.

Of course there is. It's right here:

https://www.qubes-os.org/doc/doc-guidelines/

Unfortunately, many people don't read it, which is why it's not sufficient. For better or worse, we don't have the time or workforce to write all of the documentation ourselves, so it's a community volunteer effort. Hence, our general policy regarding doc PRs is: "If accepting a PR would have a net positive effect, then accept it, even if it doesn't follow all the rules or is flawed." If we required PRs to follow all the rules before accepting them, many contributors would not be able or willing to ever fix them correctly, and we'd just lose out on those contributions, meaning we'd forgo a net benefit each time. This is just another example of not allowing the perfect to be the enemy of the good. Since there are countless little things that can go wrong with a doc PR that aren't enough to merit rejection, the cumulative benefit of automating maintenance is enormous.

andrewdavidwong commented 3 years ago

There is no writing styleguide to guide the docs, that I know of.

Of course there is. It's right here:

https://www.qubes-os.org/doc/doc-guidelines/

Reorganized doc guidelines to address this (https://github.com/QubesOS/qubes-doc/commit/c29cf40910ff68a2c6c9585ca39d73584e48b31b, https://github.com/QubesOS/qubesos.github.io/commit/a540575f0e1df720254ab1951530864a77676b29).

Old URL above now redirects to this new URL:

https://www.qubes-os.org/doc/documentation-style-guide/