Open klonos opened 6 months ago
Helpful comments, thank you. Especially:
Anyway, the regex used by Xss::attributes() in modern Drupal seems to be more accurate, and seems to allow the most commonly-used cases. I trust that our Drupal brethren have done their research and think that we should backport any fixes/improvements
An attribute name like -
or --
is still allowed by /^[-a-zA-Z][-a-zA-Z0-9]*/
, but https://www.w3.org/TR/REC-xml/#NT-NameChar does not allow those attribute names, since it does not allow an hyphen as first character.
If Drupal uses preg_match('/^([-a-zA-Z][-a-zA-Z0-9]*)/', $attr, $match)
, it still allows attribute names that should not be used.
@kiamlaluno indeed, but with a caveat: the limitations you are linking to apply to the XML specs - for HTML specs, these apply instead: https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attributes ...as stated there:
Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, """, "'", ">", "/", "=", the control characters, and any characters that are not defined by Unicode.
...and as stated right below that:
XML-compatible attribute names are those that match the Name production defined in the XML specification [XML] and that contain no ":" characters, and whose first three characters are not a case-insensitive match for the string "xml".
If we are going for XML-compatible, then the regex should be something like /^[a-zA-Z][-a-zA-Z0-9]*/
instead. According to the specs, :
and underscores are also allowed as the NameStartChar
(the first character of an attribute), but these must be edge cases. If we were to allow these as well, then the regex should be /^[a-zA-Z:_][-a-zA-Z0-9:_]*/
. Then, it is also stated that NameChar
(the second part of the regex - after the first character of the attribute name) can also include dots. Which would make the final regex /^[a-zA-Z:_][-a-zA-Z0-9:_\.]*/
...however I still see that as an edge case as well.
The point is: do we want to make _filter_xss_attributes()
fully XML-compatible, or simply allow a limited set of things that will only extend the current allowances to whatever is required for SVG support?
...I should also note that our current regex ('/^([-a-zA-Z]+)/'
) also allows for one or many dashes as attribute names. So you can currently have -
. --
, ---
etc. 🤷🏼 ...the main point I am trying to solve in order to unblock my work in #6487 is to allow the x1
/x2
/y1
/y2
attributes, and that is achieved by the addition of [-a-zA-Z0-9]*
. The main thing is the addition of 0-9
, which allows for digits in the attribute names, but since attribute names cannot start with a digit, we need to ensure that the first character does not allow that. Hence the ^[-a-zA-Z]
before the [-a-zA-Z0-9]*
.
I hope that this clarifies things a bit more.
the limitations you are linking to apply to the XML specs
That is the same link given in the previous comment.
I then came across this article, which explained things a bit clearer/simpler and pointed to this: https://www.w3.org/TR/REC-xml/#NT-NameChar
If attribute names must consist of one or more characters other than the space characters, U+0000 NULL, """, "'", ">", "/", "=", the control characters, then also the regular expression used by Drupal is too restrictive.
Given the excluded characters, there is no risk of XSS, by allowing all the characters HTML allows for attribute names.
I understand what this issue is trying to achieve, but if the regular expression is going to be changed, it should be changed to allow what HTML allows, so there will not be any further issue we could discover later.
@kiamlaluno Thanks for the links. The first is more simple, is it enough?
The regular expression, therefore, for parsing an HTML attribute is as follows:
[a-zA-Z_:][-a-zA-Z0-9_:.]
This, obviously, leaves the CombiningChar and Extender characters out of the mix.And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for? I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.
... if the regular expression is going to be changed, it should be changed to allow what HTML allows, so there will not be any further issue we could discover later.
That would need to be a decision by the @backdrop/core-committers and the Security Team I guess. On one hand, I understand the value in allowing what the specs allow - on the other hand though, restricting things to the minimum that we need in order to get things done sounds less risky and more "future-proof" ...which I guess is the reason behind it being that way for so long even in Drupal - only allow what is really needed - nothing more. Well, now with the introduction of SVG support in core, we actually have a need for more, but one could very well argue not a need for everything possible.
In fact, I think that perhaps we should either introduce a new parameter in the existing _filter_xss_attributes()
function, or introduce new function(s), specific to SVG/XML. That should minimise any attack/risk vectors, because we'd use these functions/parameters specifically and explicitly where needed, without being more permissive/risky with existing functions. For example, we already have filter_xss()
and we've added filter_xss_admin()
on top of that, which allows a wider variety of HTML tags. Perhaps we should introduce a filter_xss_svg()
or filter_xss_xml()
function?
@izmeez I apologize for the confusion: The links I gave are the ones gave from @klonos. I was just trying to understand which attribute names are allowed, to make a better comment on the proposed regular expressions.
I totally agree that not allowing attributes like y1
or x12
is a way too restrictive. I was just looking for a way to allow other attributes as well, considering that _filter_xss_attributes()
should avoid XSS attacks. It should not be more restrictive than necessary.
I also agree that using the same function in two different contexts (HTML and XML) makes writing code that is correct in both the cases difficult.
OK, here's an initial PR: https://github.com/backdrop/backdrop/pull/4805
<line>
and the x1
/x2
/y1
/y2
attributes, which should help catch any regression explicitly around that in the future.PS: I've added a comment in https://www.drupal.org/project/drupal/issues/2844627#comment-15651494 to point out that the standard allows underscores as the first character, but NOT characters after that. Also pointed out that the current regex implementation allows dashes as the first letter, which is against the standard.
I left a comment in the PR: The CSpell directives do not include a period.
Thanks @kiamlaluno for pointing that out and @herbdool for removing the periods. I have my doubts about that, but in order to not derail this issue here, I've started a discussion in Zulip: https://backdrop.zulipchat.com/#narrow/stream/218635-Backdrop/topic/phpcs.20is.20in.20a.20war.20with.20cspell/near/446336928
PR still needs review/testing.
- allows underscore as the first character (but not elsewhere, also as per https://www.w3.org/TR/REC-xml/#NT-Name), which is being discussed in https://www.drupal.org/project/drupal/issues/2844627
In https://www.drupal.org/project/drupal/issues/2844627 someone mentioned the XML vs XHTML/HTML standards, so perhaps we should allow underscores in the second part of the regex.
Either way, I would like to get this change committed, in order to unblock #6487 for the next minor release. The main change that I'm interested in is allowing numbers after the first character of attributes, which is the current bug. So I am thinking that (as I also mentioned in the d.org issue) since the purpose of _filter_xss_attributes()
is to filter XSS attempts (as the name of the function implies) rather than making sure that we adhere to the standards/specs strictly 100%, then we shouldn't focus on trying to make it do that (at least not consider it a primary goal - keep it in mind and make best efforts, but not be pedantic about it).
Thoughts?
So I am thinking that (as I also mentioned in the d.org issue) since the purpose of _filter_xss_attributes() is to filter XSS attempts (as the name of the function implies) rather than making sure that we adhere to the standards/specs strictly 100%, then we shouldn't focus on trying to make it do that (at least not consider it a primary goal - keep it in mind and make best efforts, but not be pedantic about it).
_filter_xss_attributes()
should not strip out perfectly fine attributes we are going to use. I would not go into accepting every attribute the HTML/XML recommendations allow, especially the edge cases.
@avpaderno yes, not edge cases. The standard allows for colons (:
), dots (.
), and mid-dots (·
), but as also stated in the inline comment in the PR, we should be treating those as edge cases and not supporting them. What about underscores though? Those seem like a common use case. Should we allow them, despite not adhering 100% to the spec?
An underscore is allowed at the beginning of the attribute name. I would make the code permissive about underscores after the first character.
I am not sure about considering a colon an edge case: It is perfectly fine to have a SVG like the following.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="391" height="391" viewBox="-70.5 -70.5 391 391" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<rect fill="#fff" stroke="#000" x="-70" y="-70" width="390" height="390"/>
<g opacity="0.8">
<rect x="25" y="25" width="200" height="200" fill="lime" stroke-width="4" stroke="pink" />
<circle cx="125" cy="125" r="75" fill="orange" />
<polyline points="50,150 50,200 200,200 200,100" stroke="red" stroke-width="4" fill="none" />
<line x1="50" y1="50" x2="200" y2="200" stroke="blue" stroke-width="4" />
</g>
</svg>
xmlns:xlink
is not an attack vector, and the given link is not supposed to be used to render the SVG file. (It is just an XML namespace.)
Good catch re xmlns:xlink
@avpaderno 👍🏼 ...I just read through https://stackoverflow.com/questions/18467982/are-svg-parameters-such-as-xmlns-and-version-needed and https://developer.mozilla.org/en-US/docs/Web/SVG/Namespaces_Crash_Course as well and now have a better understanding of XML namespaces and prefixes. So yeah, a colon may not be such an edge case as I thought initially after all.
Having said that, looking at the list of attributes in the sidebar of the page in the last link above, I do see a series of xlink:*
and xml:*
attributes, but they have all been deprecated (and oddly, no mention of xmlns:xlink
specifically). That doesn't mean that they aren't in use though; nor that other namespaces won't be used anywhere else.
Before I update the PR accordingly, I'll wait for some feedback from others - especially our @backdrop/core-committers and Sec team. In the meantime, I'd also like a decision re allowing underscores in general.
Summary to help things:
NameStartChar
) - practical examples show it being used in the middle of the attribute name though (like in xmlns:xlink
)._filter_xss_attributes()
_filter_xss_attributes()
backport to D7
)Other notable things:
_filter_xss_attributes()
, or just cater for security and be otherwise "permissive" about spec non-conformities?_-
or _--
or _---
or test---123
etc. as attribute names or ignore those as edge cases? (will increase the complexity of the regex, so perhaps follow-up issue as well)...I've added the needs feedback
label to this, and also tentatively needs change record
(which may mean that we should only add this in 1.29.0).
@avpaderno sorry to be going back and forth about this, but I now think that we might have been interpreting/reading the spec wrong. Ugh!... can you please also do a sanity check for me?:
The spec says:
NameStartChar
(1st character of the attribute) may be :
| [A-Z]
| _
| [a-z]
etc.NameChar
(any character of the attribute after the 1st one) may be -
| .
| [0-9]
BUT in addition to the characters allowed by NameStartChar
Name
may then be NameStartChar
(NameChar
)*Specifically point 2 above. Am I reading it right?
Yes, NameChar
includes the characters listed for NameStartChar
, which means that for example :a:attribute-name
would be valid.
While working on #6487, I stumbled upon a few problems:
theme_icon()
was not allowing the use of<line>
elements in SVGs (they were being stripped out), while it was for example allowing<path>
,<circle>
,<rect>
etc. - we'll deal with that in #6576.x1
/x2
/y1
/y2
attributes used by<line>
were being stripped out. So something like this:<line class="footer" x1="0" y1="143" x2="100" y2="143" stroke-width="15" />
...was ending up like this:<line class="footer" stroke-width="15" />
This issue here is to fix the second problem.
While doing some research, I found the following issues related to
_filter_xss_attributes()
in the Drupal core queue:The second issue was filed back in August 2011 against D7 core, and there are very few comments, with the last one in May 2015 pointing to
Xss::attributes()
.I reviewed the code in our
_filter_xss_attributes()
function and compared it to modern Drupal'sXss::attributes()
, and by only changing 2 lines, the problem of stripping out thex1
/x2
/y1
/y2
attributes from the SVG was solved. I changed the following:What's changed in the above diff is basically the regex used to match the attributes.
^([-a-zA-Z]+)
regex is looking for attribute names that have lower- and upper-case latin letters, and dashes in them. The+
means that it is looking for one or more of these letters and/or dashes, which means that is treats a single dash as a valid attribute (-="something"
) 👎🏼([-a-zA-Z][-a-zA-Z0-9]*)
regex is looking for attribute names that start with a single letter of the latin alphabet (can be either lower- or upper-case), which is then followed by either zero or more of a series of:x1
/x2
/y1
/y2
attributes)I did some further research on what is allowed and what not in both HTML as well as XML attribute names specifically, but https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name is difficult to navigate and read through (the links seem to be looping through the document and it gets confusing). I then came across this article, which explained things a bit clearer/simpler and pointed to this: https://www.w3.org/TR/REC-xml/#NT-NameChar
Name
is defined asNameStartChar
(NameChar
)*NameStartChar
can be":" | [A-Z] | "_" | [a-z]
...plus some extended additions that might be edge cases. But it clearly cannot be a dash in any case (like the current regex)NameChar
can be whateverNameStartChar
allows, plus| "-" | "." | [0-9]
...plus some extended additions again, that might be edge cases.Anyway, the regex used by
Xss::attributes()
in modern Drupal seems to be more accurate, and seems to allow the most commonly-used cases. I trust that our Drupal brethren have done their research and think that we should backport any fixes/improvements