RazrFalcon / xmlparser

A low-level, pull-based, zero-allocation XML 1.0 parser.
Apache License 2.0
130 stars 16 forks source link

XML names starting with xml (including xml) do not produce an error #23

Closed MoSal closed 1 year ago

MoSal commented 1 year ago

Hello.

Names starting with xml (irregardless of casing) do not produce a parsing error.

I don't know if this belongs here or in roxmltree, but I'm reporting here since XmlCharExt is a part of this crate.


I was writing a xml name checker for a custom derive crate to catch invalid names at compile time. But then I stumbled into this issue while testing.

RazrFalcon commented 1 year ago

Can you provide an example?

MoSal commented 1 year ago

I mean, just:

<xml></xml>

or

<xml1></xml1>

The standard states that:

Names beginning with the string "xml", or with any string which would match (('X'|'x') ('M'|'m') ('L'|'l')), are reserved for standardization in this or future versions of this specification.

I thought this means such names shouldn't be allowed.

W3 schools own tutorial mentions:

  • Element names are case-sensitive
  • Element names must start with a letter or underscore
  • Element names cannot start with the letters xml (or XML, or Xml, etc)
  • Element names can contain letters, digits, hyphens, underscores, and periods
  • Element names cannot contain spaces

But I tried Python's stdlib impl, and also Firefox's DOM parser, and both don't care. So, maybe I understood the standard wrong, or this part of it is just generally ignored!

Not sure what's right for xmlparser/roxmltree here, but I will just disallow this from my side to be extra strict.

RazrFalcon commented 1 year ago

lxml parses it just fine, so I guess this is not a bug. If unsure, try using roxmltree/testing-tools/lxml-ast.py. I'm following its logic.

The XML spec is a convoluted mess which no one follows, so figuring out what is right or wrong is mostly impossible. roxmltree/xmlparser simply mimics lxml/libxml2 behaviour.