ctargett / refguide-asciidoc-poc

Proof of concept of Solr Ref Guide converted to asciidoc format & using Asciidoctor for publishing
2 stars 4 forks source link

Should we normalize header levels as part of the conversion? #34

Open hossman opened 7 years ago

hossman commented 7 years ago

unlike confluence, which was happy to let us have a page start with an h3, and/or have pages with section headings using h2 followed by subsections using h4, asciidoctor generally frowns on this and gives lots of warnings because of it.

If we want to try to clean this up, then doing it as part of the ScrapeConfluence.java HTML cleanup code we already have (when doing our HTML cleanup on the cwiki export) would probably be the most straight forward place to do it ... otherwise i think we'd have to manually cleanup the adoc files (so```me creative grepping would at least let us quickly scan files visually looking for discrepencies)

If we want to do this in conversion code, then what i think would work pretty easily is something like the following psuedo code...

int last_header_level_used_this_page = 0;
for (int i = 1..7) {
  Elements headers = jsoup.getElementsByTag("h" + i);
  last_header_level_used_this_page++ unless headers.isEmpty()
  foreach (h : headers) {
    h.replaceTag("h" + last_header_level_used_this_page);
  }
}

(NOTE: might be an off by one error there, can't remember if the html->adoc conversion assumes/expects that we won't use any "h1" tags in the body of pages)

ctargett commented 7 years ago

I feel like we'd have to look at them all anyway to be sure we got them right. There is a short list of pages with this problem - they are output from the ant build-jekyll target as warnings. A quick count shows maybe 15 pages?

hossman commented 7 years ago

yeah, maybe manual audit/cleanup is easiest ... i just wanted to point out there is a (fairly straight forward) automated solution to this problem we could consider.