Fix inter- and intra-Document links

ctargett commented 7 years ago

@hossman filed asciidoctor/asciidoctor#1865 and asciidoctor/asciidoctor#1866 as issues around linking between different .adoc files when they've been converted to HTML or PDF.

Depending on comments or resolution to those issues, we'll need to either implement the suggested workarounds, or come up with another solution.

hossman commented 7 years ago

AFAICT asciidoctor/asciidoctor#1866 can be worked around by keeping a fairly flat structure for all included files -- it's currently not causing any problems for us with how the docs are structured (and how the PDF is built) in the repo today.

The problem with the potential workarounds for asciidoctor/asciidoctor#1865 is that they require a lot of careful work to ensure:

we never link to a doc w/o an explicit anchor in the link
we never declare 2 explicit anchors with the same name -- even if they are in diff files
- currently not a problem because of how confluence exported anchors all include the confluence page name in them
we never use the same section name in 2 diff files, unless we also define a unique sectionId for them

...that's a lot of nuance for people to have to keep in mind / pay attention to when writting docs. Especially considering that breaking any of thse rules won't cause any obvious problems with the jekyll built copy of the docs -- they will only break intra-doc links in the PDF.

My current thinking is that our "build" process (or a "precommit" process) for the ref guide could build a "single page html doc" in the same way that we would normally build a "single page pdf" doc, and then do some analysis of the HTML (either an off the shelf link checker, or some custom jsoup based code) to validate that:

no "id" is declared more then once in the final HTML
ever "intra-doc" link in the final HTML points to a valid and declared "id"

with those 2 checks, we should at least be able to flag any problematic situations, and hopefully give enough context in the error msg so the person currently editing the doc can either:

edit an anchor declaration they just added so it becomes unique
add a new anchor declaration to some header they just added so it gets a unique section id
add an explicit anchor to some link they just added
add a new anchor declaration to some header they just added

(FWIW: I did some breif experimenting with trying to do this type of analysis directly on the adoc files using asciidoctorj and didn't have much success -- but it may be possible)

ctargett commented 7 years ago

I found a couple issues with our settings which were causing some of the linking issues.

When HTML pages or a PDF are created, anchor links are automatically created for every header. Since the page title is a h1, anchors are created for that also. Even though Asciidoctor cannot yet support empty anchor refs (like page-title.adoc#), this already worked fine for HTML files because it just took you to the top of each separate HTML file, but it was 100% broken in the PDF. If we instead insert the page title as an anchor ID, page-level references within the PDF will work.

I used a couple of attributes in the asciidoctor-pdf section of build.xml to make this work right:

idprefix: this sets the prefix of the auto-generated heading anchor. By default it's an underscore, so anchor refs by default were set to _page_name. I set this to empty so there is no prefix on anchors anymore.
idseparator: this sets the separator between words of a multi-word header. This also defaults to underscore. I set this to a hyphen.

So all auto-generated anchors were like _anchor_id, but now they are anchor-id.

This would allow us to change page refs that currently look like this:

<<getting-started.adoc#,Getting Started>>

to look like this:

<<getting-started.adoc#getting-started,Getting Started>>

A simple find-replace across all files could fix this everywhere during the conversion process. In my testing locally, I couldn't find a single case of broken links in the PDF that were caused by conversion - all the bad links I found were also bad in Confluence.

It does not, however, solve any issues with unique IDs across pages, although I also did not see any examples of that being a problem as things are today (IOW, maybe we kick the can down the road on that part of the issue).

ctargett commented 7 years ago

A simple find-replace across all files could fix this everywhere during the conversion process

Actually, I really meant that a non-basic regular expression could fix this everywhere during the post-conversion manual cleanup.

hossman commented 7 years ago

it's been a while since i looked at this (and i don't have and of it in front of me at the moment) but i believe you -- in which case the 2 remaining problems i can think of could be solved by some ant tricks to scan the files and fail the build if...

it finds a link that doesn't mention an anchor
the same anchor name is used in more then one file

...both of which would be things the user who just added the link/anchor could fix.

Actually, I really meant that a non-basic regular expression could fix this everywhere during the post-conversion manual cleanup.

(or i can fix up the java conversion code to do .. probably easier honestly)

ctargett commented 7 years ago

For the first item "it finds a link that doesn't mention an anchor", we could WARN and automatically replace it with the file name. So, for example, the user makes a change and adds a link like <<new-file.adoc,New File Name>>. We could print a warning and automatically replace it with <<new-file.adoc#new-file,New File Name>>.

I'd be OK if it failed, though. It's just an idea.

The 2nd case should fail.

hossman commented 7 years ago

We could print a warning and automatically replace it with...

I don't like the idea of the build script editing the *.adoc files that are checked into git ... you might start a build of the site in your terminal, go back toyour editor to tweak some things, and then the build file overwites your changes w/o you realizing it ... etc.

safer just to fail.

alternatively: we allow/encourage links to empty anchors in the source files, and the build script copies every file into an intermediate dir and fixes them for you ... might make it easier to maintain the files if we expect most links to not need an anchor, but maybe makes it harder for editors that handle rendering asciidoc (for previewing) to work properly if you try to follow a link?

hossman commented 7 years ago

It does not, however, solve any issues with unique IDs across pages, although I also did not see any examples of that being a problem as things are today (IOW, maybe we kick the can down the road on that part of the issue).

I think i remember now ... you won't see any of the anchors migrated from cwiki have this problem, because exporting them from cwiki causes it to put the page title in the anchor text -- so they are all inherently unique. but you would see this problem alot if we were using the automatically generated asciidoctor anchors anytime multiple adoc pages had the same section name ("Examples" and "Paramaters" are ones i think get used on more then a few pages)

hossman commented 7 years ago

I just pushed fixes for #37 and #38 which should address both:

the immediate conversion task
- give every existing converted page link a #frag
longer term task of being able to fail the build if the someone creates/edits pages in a way that results in a link w/o a #frag or the same anchor in multiple pages.
- run ant check-links-and-anchors after ant build-site to see failures/verification
- (we can always tweak the build.xml so this happens automatically)

I think this issue can now be resolved?

ctargett commented 7 years ago

Did some spot-checks with HTML pages and the PDF and it looks good.

ctargett / refguide-asciidoc-poc

Fix inter- and intra-Document links #6