Normalize internal document representation

GoogleCodeExporter commented 8 years ago

The current document models (Text.Pandoc.Definition) allows different instances 
that represent the same text. For instance the following HTML input should all 
be the same:

<i><i>foo</i>bar</i>

<i>foo</i><i>bar</i>

<i>foobbar</i>

Normalization could took place at the internal document representation so you 
get syntax normalization for free. You could start with 'data Inline' in module 
'Text.Pandoc.Definition' and sanitize nested elements etc. For instance 
'Inline' can be 'Link [Inline] Target' so a link text can contain another link 
which is nonsense and most markup languages cannot express anyway.

Original issue reported on code.google.com by siehea...@googlemail.com on 23 Jul 2010 at 10:19

GoogleCodeExporter commented 8 years ago

It would be easy to add normalization between the reader and writer.  I 
experimented a bit with this, though, and I'm worried about the performance 
implications.  I guess it's a tradeoff between performance and the advantages, 
whatever they may be, of normalization. I will experiment some more....

Original comment by fiddloso...@gmail.com on 8 Dec 2010 at 8:00

GoogleCodeExporter commented 8 years ago

A --normalize option has been added. Because of the performance penalty, I'm 
not going to make it the default.

% pandoc --normalize
*hi**there*
<p
><em
  >hithere</em
  ></p
>

Original comment by fiddloso...@gmail.com on 27 Jan 2011 at 6:24

Changed state: Fixed

anammari / pandoc

Normalize internal document representation #250