Shinmera / plump

Practically Lenient and Unimpressive Markup Parser for Common Lisp
https://shinmera.github.io/plump
zlib License
120 stars 21 forks source link

Newlines and indentation in XML files get turned into text nodes #20

Closed frejanordsiek closed 6 years ago

frejanordsiek commented 6 years ago

Not sure if this is deliberate design or a bug.

If one takes an XML file that has newlines and indentation for readability such as the following test file

<something>
  <subel/>
</something>

and call it test.xml. If I then read it and look the second child with

(elt (plump:children (plump:parse #p"test.xml")) 1)

I get a text node like

#<PLUMP-DOM:TEXT-NODE {1004CE6373}>

If I then look at its text with

(plump:text (elt (plump:children (plump:parse #p"test.xml")) 1))

I get

"
"

So there is a text node with the newline. Similarly, the first child of the first child node of root is also a text node whose text can be gotten with

(plump:text (elt (plump:children (elt (plump:children (plump:parse #p"test.xml")) 0)) 0))

and is

"
  "

which has the newline and the indentation.

It is a bit easier to see all of it if plump-sexp is used to look at it with

(plump-sexp:serialize (plump:parse #p"test.xml"))

which gives

(:!ROOT
 (:SOMETHING "
  "
  (:SUBEL) "
")
 "
")
Shinmera commented 6 years ago

I'm not sure I understand what you see as the problem here. Plump parses things in a preserving way. This may not be very important for some XML formats, but it most certainly is for HTML and XML+HTML, or for similar markup formats based on XML.

frejanordsiek commented 6 years ago

OK, so it is done this way to preserve so that parse and serialize are exact inverses of each other because that is needed for HTML. That makes sense. So it is by design and not a bug.

Shinmera commented 6 years ago

For what it's worth, you can strip whitespace text from the dom with plump:strip.