Shinmera / plump

Practically Lenient and Unimpressive Markup Parser for Common Lisp
https://shinmera.github.io/plump
zlib License
119 stars 21 forks source link

Quoting in strings #40

Closed jw-rav closed 1 year ago

jw-rav commented 2 years ago

Control chracters (tab, CR, NL, and probably others, too) in strings are not properly quoted:

(let ((plump:*tag-dispatchers* plump:*xml-tags*)) (plump:serialize (plump:parse "<foo bar=\"a&#x9;b&#xD;c&#xA;d\" />")))

results in tab, CR and LF emitted literally instead of encoded in the way there were parsed.

jw-rav commented 2 years ago

Attached patch fixes it for me. But I am not confident whether this is the proper way to fix it.

encode.txt

(ugh? I can't attch files with .patch extension?)

Shinmera commented 2 years ago

Plump makes no promises about round-tripping documents. As far as I understand the XML spec, it is perfectly valid to emit literal tabs and newlines into attribute values, too.

Your fix has multiple issues, such as using #\Return instead of #\Linefeed and also leading to anything in the document, even plaintext bodies, now being encoded instead of remaining plain.

jw-rav commented 2 years ago

Thanks for your comment, @Shinmera !

Round-tripping is not the goal here. But I'd assume not changing semantics would be great.

Although I am not an XML expert, it seems to me that current behaviour changes semantics of the content. If I understand this W3C recommendadion correctly, literal whitespace in attribute values should be passed as #\Space to the application. In Contrast, encoded attribute values should pass the encoded value to the application. IMHO, this is a significant change in semantics. Even when there's no (syntactic) requirement to encode such characters, it seems to be at least "good practice" to preserve semantics. Having at least an option to encode them would be great.

Addressing your comments about the patch:

Shinmera commented 2 years ago

Err, sorry, in my initial comment I meant "using #\Newline instead of #\Linefeed". The newline character is implementation dependent and may not necessarily equal LF.

I'll have a think about this, but can't promise to patch it soon

jw-rav commented 2 years ago

I'll be happy to provide a PR once we have a common understanding about the issue and whether/how to fix it.