bgamari / trac-to-remarkup

Moved to GitLab: https://gitlab.haskell.org/bgamari/trac-to-remarkup
https://gitlab.haskell.org
BSD 3-Clause "New" or "Revised" License
2 stars 0 forks source link

Trac parser problems #18

Open tdammers opened 5 years ago

tdammers commented 5 years ago

The trac parser has some issues:

tdammers commented 5 years ago

Addressed the issues as follows, so far:

Additionally, I have made it such that the original Trac source is committed to the wiki alongside the converted Markdown; this doesn't seem to affect the visual presentation, but it does allow getting the raw Trac source back out by simply adding .trac to the URL. This is useful both for checking the conversion output, and in case we need to convert failed pages manually or semi-automatically later.

bgamari commented 5 years ago

Sounds quite reasonable to me!

tdammers commented 5 years ago

So it turns out that list parsing is completely and utterly broken; but this is actually really just a symptom of messy block separation in general.

I took a stab at fixing this, but ended up just breaking things elsewhere; I'm not sure I can come up with a better parser without changing the approach more drastically.

Part of the problem is that Trac syntax is extremely lenient and full of ambiguity - for example, list items can continue on subsequent lines, and the indentation defines the end of the list item; however, the indentation doesn't have to match exactly, it's enough to be within bounds (left edge of list item marker (-) up to, I think, two additional spaces). Further, lists can nest, and they can contain other block-level items. This is really tricky to capture correctly in a single-pass character-based parser-combinator.

So what I was thinking is that a better approach might be a two-staged parser: the first stage just groups lines into what belongs to the same block, and the second pass then parses those blocks one by one. This would be quite the operation though, so I don't know if we should go down that path.

The other approach I thought of was to take the original Trac parser and more or less port it to Haskell verbatim - it's not going to be pretty, and it involves reading ~PHP~ Python, but at least it would be based on an actual authoritative source. This too would be a rather massive effort though.

tdammers commented 5 years ago

Some more investigation reveals:

tdammers commented 5 years ago

First attempt at approach 3 (go through Trac, extract and parse HTML).

Using Pandoc's HTML parser with default options, an extremely dumb extractor that just drops everything up to the first <h1> and then also drops the last 6 blocks (which happens to be the length of the Trac footer), and Pandoc's default Markdown writer, I can convert https://ghc.haskell.org/trac/ghc/wiki/WikiFormatting?version=7 into https://gist.github.com/tdammers/40ba8a27d81b9fac046a6a89209e17fd.

Some observations:

  1. Pandoc does not retain most class names, which means extracting the payload <div> by class name is not going to work.
  2. In fact, Pandoc flattens most of the HTML structure, retaining nesting only on lists.
  3. Pandoc can't handle complex table layouts (especially not tables nested inside table cells), and just prints [TABLE] instead.

The first one is probably not going to be a problem, because we can unambiguously identify link targets as wiki, ticket, or external based on the URL alone.

The second one is not a problem, I think; we don't really want any deep nesting in the markdown anyway.

The third one is unfortunate, and I would say rules out the quick HTML -> Pandoc -> light processing -> Markdown route. Other options to investigate:

tdammers commented 5 years ago

The scraping approach looks promising.

https://github.com/bgamari/trac-to-remarkup/blob/wip/tobias-scrape/src/Trac/Scraper.hs

This one already produces fairly reasonable output from Trac HTML. Missing bits:

Nested tables are not supported yet, and probably won't be, but we might want to make the fallback ever-so-slightly nicer - our AST can actually represent them, but the markdown formatter makes a malformed mess if we try to render them.