ctargett / refguide-asciidoc-poc

Proof of concept of Solr Ref Guide converted to asciidoc format & using Asciidoctor for publishing

2 stars 4 forks source link

better conversion and doc metadata navication #4

Closed hossman closed 7 years ago

hossman commented 8 years ago

This PR builds on and superceeds PR #3

The continued theme here is improving the things that can be automated about building the refguide with ascidoctor. All improvements falling into 2 main categories:

Ongoing Usage: automating the building of navigation files used by jekyll and the PDF
One Time Usage: improving the amount of "hands free" automation involved in the one-time converstion from confluence to adoc files to minimize the amount of hand editing needed

As part of this work, I've updated the files in confluence-export/converted-asciidoc/ to show off the automated converstion code against a recent snapshot of the ref guide -- No hand editing of any kind has been done to these files

Ongoing Usage

The automated Navigation code (that we would in theory run on every site build/publish) can be seen by running ant hoss from the top level of the project. Behind the scenes this will call ant build-nav creating 2 files (which, for the sake of this POC are currently committed into GIT so they can be easily reviewed/compared between revisions to the code)...

confluence-export/converted-asciidoc/_data/pdf-main-body.adoc
confluence-export/converted-asciidoc/_data/sidebar.json

Both these files are built by some new java code that does a hierarchical walk of every *.adoc file in the ref guide, starting with apache-solr-reference-guide.adoc and looking at the metadata about which pages are "children" of each page. pdf-main-body.adoc is then created with a list of every page in the ref guide, in the correct order. sidebar.json is likewise created to automatically list every page in the guide, and preserves 2 levels of depth of the hierarchy (since that's all the jekyll theme currently supports). (Note: this same javacode can/should/cloud in the future help validate our document strucuture: failing if there are orphan pages or inconsistent header levels, adding "next" links, etc...)

Once these navigation files are built, pdf-main-body.adoc is included by pdf/SolrRefGuide-all.adoc (a new file I created so the existing hand edited pdf/SolrRefGuide.adoc could be preserved for comparison -- ant pdf2 still builds exactly what it use to) and build/SolrRefGuide-all.pdf is generated.

Likewise, all existing configuration, themes, and templating in jekylltest are copied into build/jekyll/, and confluence-export/converted-asciidoc/ is overlayed on top of it, to build build/jekyll/_site/

One Time Usage

A summary of some of the improvements to the confluence->asciidoc converstion included in this PR;

Better permalink filenames and fixed intra document links to use correct syntax
Preserved metata about page children
Better image filenames, and corrected image include syntax
Better Section Headers
Improved detection/handling of Confluence TOC macro - TOC metadata now preserved at the page level
Cleaned up more empty tags
preserved "code" class for syntax highlighting

Although a snapshot of all the converted adoc files currently exists in confluence-export/converted-asciidoc/ if you wish to regenerate them with a new snapshot from confluence, a conviniece target exists in the top level build.xml: ant convert-raw-confluence-exports

NOTE: This will fail if the following 2 directories do not exist:

confluence-export/raw-export (An unzip copy of Confluences HTML export)
confluence-export/raw-xml-export (An unziped copy of Confluences XML export)
- This is currently needed to capture the page hierarchy information to add to the page metadata
- We could remove the need for this by instead parsing the data from the index.html file of the HTML export (I didn't realize it was there when i started) but since it's a One Time only converstion, optimizing the human effort to run this step doesn't seem worth while.

ctargett commented 7 years ago

First set of notes from my current state of review. I have to look into more how to format inter/intra-document links to make sure we can do it correctly, and may have more come out of that.

I think page titles should be formatted with a single equal sign (=) before the title instead of the Markdown style of multiple equal signs in the line after the title. I think this is cleaner and more consistent with how headings are defined.
Admonitions (NOTE, TIP, WARNING, INFO) are getting dropped. Using https://github.com/hossman/refguide-asciidoc-poc/blob/b06a3439bf1a94f036d74e37eb3125ac7c24cd67/confluence-export/converted-asciidoc/realtime-get.adoc as an example, the "Note" at the bottom of the page could be converted to a Asciidoctor-style NOTE by converting the text to uppercase and adding a space after the colon (i.e., NOTE:). The text can either start on the same line, or the very next line.

hossman commented 7 years ago

...I have to look into more how to format inter/intra-document links to make sure we can do it correctly, and may have more come out of that....

I've been looking into that as well -- part of the problem is that asciidoctor seems to be really finiky/confusing about how it resolves paths in includes and links and what not relative to diff things depending on diff usage ... i think I see a light at the end of the tunnel on fixing that with our PDF generation for very explicit links like <<foo.adoc#anchor_name_that_exists_in_foo,Foo>>. But a distinct HUGE problem is asciidoctor/asciidoctor#1865 -- if we don't see a positive outcome on that issue, then the only recourse i think we would have is some sort of precommit tool that complains if:

a link to another page doesn't point to a named anchor
multiple anchors exist with the same name (even in diff docs)
a link points to an implicit (section header) anchor instead of an explicitly defined anchor (since implicit section header anchors might be renamed silently when included in other files, but the links aren't renamed as well)

I think page titles should be formatted with a single equal sign (=) before the title instead of the Markdown style of multiple equal signs ...

Ah, ok ... i assumed since pandoc, was still doing that for doc titles even though it's configured not to for headers that it was the "correct" way to do doctitles ... that should be easy to fix in post-processing.

Admonitions (NOTE, TIP, WARNING, INFO) are getting dropped.

yeah ... it's on my list of stuff to look into more ... one of the things i wasn't sure how to deal with is that when we use these in confluence, sometimes they have "titles" and the bodies can span multiple lines (paragraphs) ... i wanted to read up on the equivilent asciidoc syntax to see how much of that could be preserved.

hossman commented 7 years ago

NOTE: I updated the PDF generation configs in 7f24fc3 to workaround asciidoctor/asciidoctor#1866 and pushed to this PR branch.

This fixes some of the intra-doc links in the final PDF, in the specific cases where an explicit anchor was used.

For example, with the current adoc files: PDF page 340, in "Velocity Search UI" section, this link from velocity-search-ui.adoc...

For more information about the Velocity Response Writer, see the <<response-writers.adoc#ResponseWriters-VelocityResponseWriter,Response Writer page>>.

...now correctly takes you to page #560 where that (explicitly & uniquely) named anchor is defined.

ctargett commented 7 years ago

Some more stuff to try to fix, with document names to be able to see what I mean:

tables have huge === blocks (collections-api.adoc), but asciidoctor standard should only be 3 (===) at top & bottom. This isn't a huge deal, and doesn't cause them to break, but it's more standardized and cleaner.
Some inline code is getting spaces around backticks (collectons-api.adoc). This causes them not to be monospaced properly.
In some cases, an anchor is being inserted between the heading indicator (==) and the heading. This causes it to not be linked properly from the on-page TOC (collections-api.adoc, Add a Replica section).
TOC added 2x in some cases. By default it's on every page, but some converted pages from Confluence already had TOCs and that's carried through conversion (uploading-data-with-index-handlers.adoc). I think we can just remove all Confluence TOCs, they were never very helpful in the PDF so probably won't be missed. WDYT?
[source,js] leads to weird formatting of those code blocks (basic-authentication.adoc). Replace all instances with [source,json] since there aren't any JavaScript code blocks in the ref guide and the source highlighter has a JSON lexer. Some are curl commands, but will look fine with JSON formatting (I checked it).
anchors with parentheses () in them (common-query-parameters.adoc, The fl (Field List) Parameter) don't work and make the heading not appear as a heading. This might happen only on one page, and we could just fix the headings, but it might be a general issue with other types of special characters.

ctargett commented 7 years ago

For the issue with the parentheses, I found another page other-schema-elements.adoc where there is an & in the anchor and it is breaking in the same way. So, it's not just (), but also &, and I would expect most of the other special characters.

edit: the-standard-query-parser.adoc has another example with a caret ^.

ctargett commented 7 years ago

Ordered lists (numbered), appear to be converting incorrectly.

https://cwiki.apache.org/confluence/display/solr/Schema+Factory+Definition+in+SolrConfig has a numbered list at the bottom of the page. Item 2 has 2 sub-items a & b. This is being converted as this in schema-factory-definition-in-solrconfig.adoc:

1 2 1 2 3

And in the HTML/PDF, it's coming out as 1-5 (and an error is thrown about list item out of sequence). Those 2 sub-items are lost as sub-items.

The correct syntax for these lists is to start the line with a dot (.), and nested items with multiple dots (.. for two, etc.).

cross-data-center-replication.adoc (around lines 705, 718, 739) also gives a good demonstration of the same problem.

hossman commented 7 years ago

(I haven't forgotten the other stuff i said i'd work on, i promise, but here's some comments regarding the latest stuff cassandra asked about)

tables have huge === blocks...

pandoc is doing that, but it should be trivial to cleanup in the one time post processing script.

Some inline code is getting spaces around backticks...

that should be easy to cleanup in the one time java code that produces the HTML.

In some cases, an anchor is being inserted between the heading indicator...

That looks like it's coming from some really weird raw HTML? probably due to some really crazy extra anchor formatting in the original confluence pages? ... i'll need to look into that more to figure out what's happenign and how widespread it is.

TOC added 2x in some cases. By default it's on every page, ... I think we can just remove all Confluence TOCs, they were never very helpful in the PDF so probably won't be missed. WDYT?

Hmm, hold on a sec -- Lemme just clarify: right now the TOC is not on every page -- it's only on pages that have a :toc: header. When that's present in the adoc header, then the jekyll generated HTML file gets a Table of Contents, but the corisponding PDF section does not get any sort of section specific TOC.

that's what you are seeing, correct?

The 2x TOC is coming from a missunderstanding i had, i thought that with the :toc: header you would get a TOC at the top of the page, and that if you wanted to override where the TOC was, you could use the toc::[] macro to indicate where the TOC would exist -- but either way you hda to have the :toc: header. Aparently I was incorrect, and using both the :toc: header and toc::[] macro results in two copies of the TOC.

What i don't understand is why the TOC generated by the toc::[] is showing up at the top of the jekyll HTML pages instead of the place where the macro exists in the body of the page? in uploading-data-with-index-handlers.adoc the macro is after several intro paragraphs, and yet in uploading-data-with-index-handlers.html it appears at the top of the doc -- even if I remove the :toc: header so only one TOC appears on the page, it's still at the top of the HTML file.

In any case, it may not matter if we want to remove the inline TOCs anyway...

I agree with you that the section TOCs don't really make much sense in the PDF (and i think most of them are currently supressed in the confluence generated PDF due to CSS?). Personally I kind of liked having the "inline TOC" come after the intro paragraphs when browsin the confluence HTML -- but I have no objection to removing them as part of the one time conversion if you think that's better in the long run. One pplus of removing them completley is that it keeps the HTML and the PDF consistent: no risk of text in the PDF that makes no sense w/o the TOC being there. (ex: "Parameters covered in this section" in https://cwiki.apache.org/confluence/display/solr/IndexConfig+in+SolrConfig which the one time html conversion code already throws away when looking for the Confluence TOC macro today.

I'll go ahead and remove all the toc::[] macros and just leave the page header :toc: macros

...Replace all instances with [source,json]...

yup, yup ... that'll be a trivial fix in the one time java code that produces the HTML.

anchors with parentheses () in them ... it might be a general issue with other types of special characters.

Yeah ... this might take a while. I'm going to need to bone up on what the asciidoc spec says about level characters in anchors -- my initial skim of teh user manual was that all utf-8 chars were legal, but aparently not

Ordered lists (numbered), appear to be converting incorrectly. ...

This appears to be a bug in pandoc's adoc output formatting. Off the top of my head I can't think of a particularly easy fix/workaround we can do in either the HTML cleanup or adoc post processing to fix this.

Our best bet is probably to check for nested lists in the HTML cleanup code, when the docs still retain as much of hte original structure as possible, and have it log/record anytime it sees a doc with a nested list so we have a checklist of stuff we know needs manual audit/cleanup.

ctargett commented 7 years ago

The 2x TOC is coming from a missunderstanding i had, i thought that with the :toc: header you would get a TOC at the top of the page, and that if you wanted to override where the TOC was, you could use the toc::[] macro to indicate where the TOC would exist -- but either way you hda to have the :toc: header. Aparently I was incorrect, and using both the :toc: header and toc::[] macro results in two copies of the TOC.

The Jekyll layout is putting the first TOC on the page (see https://github.com/ctargett/refguide-asciidoc-poc/blob/master/jekylltest/_layouts/page.html#L37). Then when there is one already in the Confluence page, you're adding the :toc: and toc::[] macros. It doesn't process both of those macros, but it's processing the layout and the page instructions, that's how there are 2 of them.

Because the Jekyll layout is inserting the TOC, then every page has it from its template (unless it has :page-toc: false - see line 4 in one of my sample pages, https://github.com/ctargett/refguide-asciidoc-poc/blob/master/jekylltest/refguide/CharFilterFactories.adoc). Then the pages where it existed in Confluence is getting it added again. Sort of an accident of layers of processing.

In this case, what I'm suggesting is to drop all the :toc: and toc::[] entries added from the conversion entirely and only use the TOC that comes from the template. We can later decide on a case-by-case basis to override that (with :page-toc: false) and decide where to put the TOC on the page where/if it makes sense to do so.

hossman commented 7 years ago

Latest update to the PR branch addresses all of the following issues...

Stop including :toc: header and and toc::[] macros in pages, trust the jekyll presentation to do the best thing
Use '= TITLE' syntax instead of excessively verbose 'TITLE\n======....'
Use '|===' syntax for tables instead of excessively verbose '|======....'
use 'json' instead of 'js' in source tags, and override with 'bash' if code starts with a 'curl' command
- cassandra mentioned that that 'json' should look fine for the curl commands, but when I tried that the syntax highligher was actaully putting red boxes around anything that wasn't legal json (ie: unquoted string literals) so I went ahead and added the extra check for 'curl' and treated those as 'bash
- the [source,bash] highlighter doesn't do any nice highlighting of the JSON payload to the curl arguments, but it's better then the plain text you get from [source] w/o any langauge declaration

hossman commented 7 years ago

In some cases, an anchor is being inserted between the heading indicator...

on one hand, this is the confluence UI being stupid. In some cases, when people have use the confluence "anchor" macro (i think that's what it's called) to add an anchor with a specific name (like "api2" on the "Collections API" page) confluence has decided to put it inside the header, instead of around/before the header.

a fun side effect of this is that confluence is including these anchor names in the anchor names it generates automatically for the headers themselves, so confluence is taking stuff like <h1><anchor-macro id="api2">Reload a Collection</h1> and outputing <h1 id="CollectionsAPI-api2ReloadaCollection><span id="api2"></span>Reload a Collection</h1>

on the other hand, acording to the asciidoc spec, this is allowed: http://asciidoctor.org/docs/user-manual/#anchordef

if you look at the HTML jekyll generated for the == header, it does what it seems like it should do: defines both anchors (which is what we had in confluence)

the problem seems to just be some naive parsing of the header in the javascript that builds the TOC in our theme

I wonder if this is also the root of the problem you mentioned with parens and ampersands?

I experimented with a workaround for this that would the extra anchor out of the header, but I'm not really happy with the results...

current output:

[[CollectionsAPI-api2ReloadaCollection]]
== [[CollectionsAPI-api2]] Reload a Collection

experimental output:

[[CollectionsAPI-api2]]
[[CollectionsAPI-api2ReloadaCollection]]
== Reload a Collection

The problem is if we do this, asciidoctor just plain ignores the first anchor in it's html output. Evidently since there's no "content" associated with that anchor. I tried including a // adoc comment after it and that didn't change anything -- only when some significant text (even just an X) is in between the 2 anchors will it keep both of them. The existing format (with the anchor inside the header) actual works and preserves both anchors -- the first because it's followed by the "header", and second because it's followed by the "header text"

My concern with this is that if we don't keep both anchors, we might be breaking some existing links ... it's hard to tell at this point because so much is already out of whack with inter/intra document links.

It might be easier/better to just fix the javascript that generates the toc ... what do you think cassandra?

hossman commented 7 years ago

I wonder if this is also the root of the problem you mentioned with parens and ampersands?

That is definitely a distinct problem: asciidoctor/asciidoctor/issues/1873 I'll think about possible workarounds we can do - my best idea so far is rewriting anchor Ids during link conversion using similar rules to how we deal with confluence pageId->file-name.adoc conversion, except that opens a huge can of worms i'd rather avoid if i can think of something simpler.

hossman commented 7 years ago

Another update, addressing 2 big problems with anchors, headers, and links...

In some cases, an anchor is being inserted between the heading indicator...

... on the other hand, acording to the asciidoc spec, this is allowed... ... I experimented with a workaround for this that would the extra anchor out of the header, but I'm not really happy with the results... ... The problem is if we do this, asciidoctor just plain ignores the first anchor in it's html output. Evidently since there's no "content" associated with that anchor.

I filed asciidoctor/asciidoctor/issues/1874 and asciidoctor/asciidoctor/issues/1875 regarding these issues in general, and revisted my work around (which is included in the latest PR) ... as things stand now: if someone used a confluence macro to explicitly define a named anchor inside a header, I rewrite the HTML so the id from that macro becomes the id of the header AND I move the old (confluence assigned header id) into an anchor declaration before the header. The end result is that both are in the adoc file, but the one a user explicitly picked gets used by jekyll, and the other one is (currently) ignored, but still in the adoc files so we can grep for it if we find a broken link and aren't sure where it suppose to be pointing.

Example of previous adoc:

[[CollectionsAPI-api11ClusterProperties]]
== [[CollectionsAPI-api11]]Cluster Properties

Example of new adoc:

[[CollectionsAPI-api11ClusterProperties]]

[[CollectionsAPI-api11]]
== Cluster Properties

That is definitely a distinct problem: asciidoctor/asciidoctor#1873 I'll think about possible workarounds we can do - my best idea so far is rewriting anchor Ids during link conversion using similar rules to how we deal with confluence pageId->file-name.adoc conversion, except that opens a huge can of worms i'd rather avoid if i can think of something simpler.

I couldn't think of any better ways of dealing with this problem, so for now I've implemented code to rewrite the anchors & URL fragments that link to them so any "special" characters are converted to underscore. Since this means we lose information about the original anchor name, I also included an adoc comment when this is done, so if there's a problem it's easy to grep for old anchor names -- and likewise it will be easy to search/remove these comments if we don't need/want them later on...

Example of previous adoc:

|<<CommonQueryParameters-Thefq(FilterQuery)Parameter,fq>> |Applies a filter query to the search results.

...

[[CommonQueryParameters-Thefq(FilterQuery)Parameter]]
== The `fq` (Filter Query) Parameter

Example of new adoc...

|<<CommonQueryParameters-Thefq_FilterQuery_Parameter,fq>> |Applies a filter query to the search results.

...

// OLD_CONFLUENCE_ID: CommonQueryParameters-Thefq(FilterQuery)Parameter

[[CommonQueryParameters-Thefq_FilterQuery_Parameter]]
== The `fq` (Filter Query) Parameter

This should resolve the majority of the linking problems in both the jekyll and pdf builds, but it still doesn't do anything to address inter-doc links that don't use an explicit anchors and how that goes to the top of the PDF when all the docs are merged together instead of the top of the appropriate section or any of the other impacts of asciidoctor/asciidoctor/issues/1865 ... i'm still thinking about potential work arounds for that.