Trac parser problems - Githubissues

tdammers commented 5 years ago

The trac parser has some issues:

The optional ='s at the end of a header are not ignored but parsed as text.
Lists items are prematurely terminated without parsing any list item contents under some circumstances.
The parser sometimes fails when (I think) it shouldn't.
On some inputs, the parser ends up in an infinite loop.

tdammers commented 5 years ago

Addressed the issues as follows, so far:

Any run of ='s at the end of a header is now skipped. This isn't 100% to spec (number of ='s must match opening ='s run), but good enough for our use case.
Indentation-based list parsing seems thoroughly broken; fully emulating Trac's behavior here is quite a minefield though, so I suggest we leave it as is for the time being, and hope for the best. Unfortunately, lists are ubiquitous, and it's not easy to detect incorrect conversions at any point, so we don't even know how big of a problem this is.
To catch infinite loops, I added a 10-second timeout to the conversion pass; anything that takes longer is considered failed.
To address failed parses, I changed the handler so that it generates a special page that contains both the actual error message and location, and the original Trac source in a code block. This should make it easier to find broken pages after the migration, and recover the original Trac source in case we want to retry the conversion, or do it manually somehow.

Additionally, I have made it such that the original Trac source is committed to the wiki alongside the converted Markdown; this doesn't seem to affect the visual presentation, but it does allow getting the raw Trac source back out by simply adding .trac to the URL. This is useful both for checking the conversion output, and in case we need to convert failed pages manually or semi-automatically later.

bgamari commented 5 years ago

Sounds quite reasonable to me!

tdammers commented 5 years ago

So it turns out that list parsing is completely and utterly broken; but this is actually really just a symptom of messy block separation in general.

I took a stab at fixing this, but ended up just breaking things elsewhere; I'm not sure I can come up with a better parser without changing the approach more drastically.

Part of the problem is that Trac syntax is extremely lenient and full of ambiguity - for example, list items can continue on subsequent lines, and the indentation defines the end of the list item; however, the indentation doesn't have to match exactly, it's enough to be within bounds (left edge of list item marker (-) up to, I think, two additional spaces). Further, lists can nest, and they can contain other block-level items. This is really tricky to capture correctly in a single-pass character-based parser-combinator.

So what I was thinking is that a better approach might be a two-staged parser: the first stage just groups lines into what belongs to the same block, and the second pass then parses those blocks one by one. This would be quite the operation though, so I don't know if we should go down that path.

The other approach I thought of was to take the original Trac parser and more or less port it to Haskell verbatim - it's not going to be pretty, and it involves reading ~PHP~ Python, but at least it would be based on an actual authoritative source. This too would be a rather massive effort though.

tdammers commented 5 years ago

Some more investigation reveals:

Trac's wiki processor is a mess. What is called the "parser" is really more like a crude regex-based tokenizer, the actual parsing is intertwined with the formatting. Porting that to Haskell is pretty much a no-go.
Ripping the wiki processor out isn't going to work either, due to the intertwinement, and the fact that it depends on a large part of a live Trac install, which we would have to scaffold; and besides, at no point during the process could we easily extract something resembling an AST or DOM.
Trac itself has a fairly straightforward HTML frontend that should be relatively easy to screen-scrape (including history). Given a list of Wiki pages, a radically different approach would be to just scrape HTML content from Trac, and then massage that into markdown. The main obstacle, then would be that we'd need to translate internal links, but I don't think that's going to be a huge issue, since those are fairly easy to recognize.
The alternative would be to give a proper parser another go. https://gist.github.com/tdammers/99eb311941fab9704620e107512f0ac4 is an actual page from the Trac wiki that covers most, if not all, of the edge cases and quirks, so the goal, I think, should be to be able to process this document without bailing entirely, correctly handling what we can, and gracefully skipping over the parts we can't make sense of. To make debugging the parser easier, I'd also distill hopefully orthogonal test cases from this example.

tdammers commented 5 years ago

First attempt at approach 3 (go through Trac, extract and parse HTML).

Using Pandoc's HTML parser with default options, an extremely dumb extractor that just drops everything up to the first <h1> and then also drops the last 6 blocks (which happens to be the length of the Trac footer), and Pandoc's default Markdown writer, I can convert https://ghc.haskell.org/trac/ghc/wiki/WikiFormatting?version=7 into https://gist.github.com/tdammers/40ba8a27d81b9fac046a6a89209e17fd.

Some observations:

Pandoc does not retain most class names, which means extracting the payload <div> by class name is not going to work.
In fact, Pandoc flattens most of the HTML structure, retaining nesting only on lists.
Pandoc can't handle complex table layouts (especially not tables nested inside table cells), and just prints [TABLE] instead.

The first one is probably not going to be a problem, because we can unambiguously identify link targets as wiki, ticket, or external based on the URL alone.

The second one is not a problem, I think; we don't really want any deep nesting in the markdown anyway.

The third one is unfortunate, and I would say rules out the quick HTML -> Pandoc -> light processing -> Markdown route. Other options to investigate:

https://hackage.haskell.org/package/tagsoup - should work, but the tree representation is somewhat spartan, which might make things a little inconvenient.
https://hackage.haskell.org/package/scalpel - uses tagsoup under the hood, exposing a more convenient API, but unfortunately doesn't expose the underlying tree data structures, so it is probably going to be useless for us
https://hackage.haskell.org/package/dom-selector - http-conduit based, with familiar CSS selector syntax. Not sure whether http-conduit data structures are sufficient for our needs though.
https://hackage.haskell.org/package/taggy - might work, DOM data structures seem straightforward; taggy-lens provides lens-style accessors in case it gets too ugly.

tdammers commented 5 years ago

The scraping approach looks promising.

https://github.com/bgamari/trac-to-remarkup/blob/wip/tobias-scrape/src/Trac/Scraper.hs

This one already produces fairly reasonable output from Trac HTML. Missing bits:

Analyzing, categorizing and converting links (~wiki~, ~tickets~, comments, ~external~) - comment links, do we have any of those in the wiki?
~Definition lists (<dl>)~
?query=... links; since gitlab uses completely different URL syntax for things like ticket queries, we cannot easily reuse them. Instead we will leave them unchanged, but emit some log output that will make it easy to generate a list of wiki pages that require manual attention.
~Numbered lists (only the rendering side though, the scraper already understands them)~
Some inline styles: strikethrough, subscript, superscript

Nested tables are not supported yet, and probably won't be, but we might want to make the fallback ever-so-slightly nicer - our AST can actually represent them, but the markdown formatter makes a malformed mess if we try to render them.

bgamari / trac-to-remarkup

Trac parser problems #18