Open tdammers opened 5 years ago
Addressed the issues as follows, so far:
=
's at the end of a header is now skipped. This isn't 100% to spec (number of =
's must match opening =
's run), but good enough for our use case.Additionally, I have made it such that the original Trac source is committed to the wiki alongside the converted Markdown; this doesn't seem to affect the visual presentation, but it does allow getting the raw Trac source back out by simply adding .trac
to the URL. This is useful both for checking the conversion output, and in case we need to convert failed pages manually or semi-automatically later.
Sounds quite reasonable to me!
So it turns out that list parsing is completely and utterly broken; but this is actually really just a symptom of messy block separation in general.
I took a stab at fixing this, but ended up just breaking things elsewhere; I'm not sure I can come up with a better parser without changing the approach more drastically.
Part of the problem is that Trac syntax is extremely lenient and full of ambiguity - for example, list items can continue on subsequent lines, and the indentation defines the end of the list item; however, the indentation doesn't have to match exactly, it's enough to be within bounds (left edge of list item marker (-
) up to, I think, two additional spaces). Further, lists can nest, and they can contain other block-level items. This is really tricky to capture correctly in a single-pass character-based parser-combinator.
So what I was thinking is that a better approach might be a two-staged parser: the first stage just groups lines into what belongs to the same block, and the second pass then parses those blocks one by one. This would be quite the operation though, so I don't know if we should go down that path.
The other approach I thought of was to take the original Trac parser and more or less port it to Haskell verbatim - it's not going to be pretty, and it involves reading ~PHP~ Python, but at least it would be based on an actual authoritative source. This too would be a rather massive effort though.
Some more investigation reveals:
First attempt at approach 3 (go through Trac, extract and parse HTML).
Using Pandoc's HTML parser with default options, an extremely dumb extractor that just drops everything up to the first <h1>
and then also drops the last 6 blocks (which happens to be the length of the Trac footer), and Pandoc's default Markdown writer, I can convert https://ghc.haskell.org/trac/ghc/wiki/WikiFormatting?version=7 into https://gist.github.com/tdammers/40ba8a27d81b9fac046a6a89209e17fd.
Some observations:
<div>
by class name is not going to work.[TABLE]
instead.The first one is probably not going to be a problem, because we can unambiguously identify link targets as wiki, ticket, or external based on the URL alone.
The second one is not a problem, I think; we don't really want any deep nesting in the markdown anyway.
The third one is unfortunate, and I would say rules out the quick HTML -> Pandoc -> light processing -> Markdown route. Other options to investigate:
The scraping approach looks promising.
https://github.com/bgamari/trac-to-remarkup/blob/wip/tobias-scrape/src/Trac/Scraper.hs
This one already produces fairly reasonable output from Trac HTML. Missing bits:
<dl>
)~?query=...
links; since gitlab uses completely different URL syntax for things like ticket queries, we cannot easily reuse them. Instead we will leave them unchanged, but emit some log output that will make it easy to generate a list of wiki pages that require manual attention.Nested tables are not supported yet, and probably won't be, but we might want to make the fallback ever-so-slightly nicer - our AST can actually represent them, but the markdown formatter makes a malformed mess if we try to render them.
The trac parser has some issues:
=
's at the end of a header are not ignored but parsed as text.