[D8] Backport the cleanup and fixes from `Xss::attributes()` to our own `_filter_xss_attributes()` function

klonos commented 4 months ago

While working on #6487, I stumbled upon a few problems:

theme_icon() was not allowing the use of <line> elements in SVGs (they were being stripped out), while it was for example allowing <path>, <circle>, <rect> etc. - we'll deal with that in #6576.
Once I got that figured out, the next problem was that the x1/x2/y1/y2 attributes used by <line> were being stripped out. So something like this: <line class="footer" x1="0" y1="143" x2="100" y2="143" stroke-width="15" /> ...was ending up like this: <line class="footer" stroke-width="15" />

This issue here is to fix the second problem.

While doing some research, I found the following issues related to _filter_xss_attributes() in the Drupal core queue:

https://www.drupal.org/project/drupal/issues/2844627 (marked for backport to D7)
https://www.drupal.org/project/drupal/issues/1256514

The second issue was filed back in August 2011 against D7 core, and there are very few comments, with the last one in May 2015 pointing to Xss::attributes().

I reviewed the code in our _filter_xss_attributes() function and compared it to modern Drupal's Xss::attributes(), and by only changing 2 lines, the problem of stripping out the x1/x2/y1/y2 attributes from the SVG was solved. I changed the following:

    switch ($mode) {
      case 0:
        // Attribute name, href for instance.
-       if (preg_match('/^([-a-zA-Z]+)/', $attr, $match)) {
+       if (preg_match('/^([-a-zA-Z][-a-zA-Z0-9]*)/', $attr, $match)) {
          $attrname = strtolower($match[1]);
          $skip = (
            $attrname == 'style' ||
            substr($attrname, 0, 2) == 'on' ||
            substr($attrname, 0, 1) == '-' ||
            // Ignore long attributes to avoid unnecessary processing overhead.
            strlen($attrname) > 96
          );
          $working = $mode = 1;
-         $attr = preg_replace('/^[-a-zA-Z]+/', '', $attr)
+         $attr = preg_replace('/^[-a-zA-Z][-a-zA-Z0-9]*/', '', $attr);
        }
        break;

What's changed in the above diff is basically the regex used to match the attributes.

Before: the ^([-a-zA-Z]+) regex is looking for attribute names that have lower- and upper-case latin letters, and dashes in them. The + means that it is looking for one or more of these letters and/or dashes, which means that is treats a single dash as a valid attribute (-="something") 👎🏼
After: the ([-a-zA-Z][-a-zA-Z0-9]*) regex is looking for attribute names that start with a single letter of the latin alphabet (can be either lower- or upper-case), which is then followed by either zero or more of a series of:
- dashes
- lower- and upper-case latin letters
- numbers (this is the bit that fixed the stripping of the x1/x2/y1/y2 attributes)

I did some further research on what is allowed and what not in both HTML as well as XML attribute names specifically, but https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#attribute-name is difficult to navigate and read through (the links seem to be looping through the document and it gets confusing). I then came across this article, which explained things a bit clearer/simpler and pointed to this: https://www.w3.org/TR/REC-xml/#NT-NameChar

Name is defined as NameStartChar (NameChar)*
NameStartChar can be ":" | [A-Z] | "_" | [a-z] ...plus some extended additions that might be edge cases. But it clearly cannot be a dash in any case (like the current regex)
then the zero or more NameChar can be whatever NameStartChar allows, plus | "-" | "." | [0-9] ...plus some extended additions again, that might be edge cases.

Anyway, the regex used by Xss::attributes() in modern Drupal seems to be more accurate, and seems to allow the most commonly-used cases. I trust that our Drupal brethren have done their research and think that we should backport any fixes/improvements

izmeez commented 4 months ago

Helpful comments, thank you. Especially:

Anyway, the regex used by Xss::attributes() in modern Drupal seems to be more accurate, and seems to allow the most commonly-used cases. I trust that our Drupal brethren have done their research and think that we should backport any fixes/improvements

avpaderno commented 4 months ago

An attribute name like - or -- is still allowed by /^[-a-zA-Z][-a-zA-Z0-9]*/, but https://www.w3.org/TR/REC-xml/#NT-NameChar does not allow those attribute names, since it does not allow an hyphen as first character.
If Drupal uses preg_match('/^([-a-zA-Z][-a-zA-Z0-9]*)/', $attr, $match), it still allows attribute names that should not be used.

klonos commented 4 months ago

@kiamlaluno indeed, but with a caveat: the limitations you are linking to apply to the XML specs - for HTML specs, these apply instead: https://www.w3.org/TR/2012/WD-html-markup-20120329/syntax.html#syntax-attributes ...as stated there:

Attribute names must consist of one or more characters other than the space characters, U+0000 NULL, """, "'", ">", "/", "=", the control characters, and any characters that are not defined by Unicode.

...and as stated right below that:

XML-compatible attribute names are those that match the Name production defined in the XML specification [XML] and that contain no ":" characters, and whose first three characters are not a case-insensitive match for the string "xml".

If we are going for XML-compatible, then the regex should be something like /^[a-zA-Z][-a-zA-Z0-9]*/ instead. According to the specs, : and underscores are also allowed as the NameStartChar (the first character of an attribute), but these must be edge cases. If we were to allow these as well, then the regex should be /^[a-zA-Z:_][-a-zA-Z0-9:_]*/. Then, it is also stated that NameChar (the second part of the regex - after the first character of the attribute name) can also include dots. Which would make the final regex /^[a-zA-Z:_][-a-zA-Z0-9:_\.]*/ ...however I still see that as an edge case as well.

The point is: do we want to make _filter_xss_attributes() fully XML-compatible, or simply allow a limited set of things that will only extend the current allowances to whatever is required for SVG support?

klonos commented 4 months ago

...I should also note that our current regex ('/^([-a-zA-Z]+)/') also allows for one or many dashes as attribute names. So you can currently have -. --, --- etc. 🤷🏼 ...the main point I am trying to solve in order to unblock my work in #6487 is to allow the x1/x2/y1/y2 attributes, and that is achieved by the addition of [-a-zA-Z0-9]*. The main thing is the addition of 0-9, which allows for digits in the attribute names, but since attribute names cannot start with a digit, we need to ensure that the first character does not allow that. Hence the ^[-a-zA-Z] before the [-a-zA-Z0-9]*.

I hope that this clarifies things a bit more.

avpaderno commented 4 months ago

the limitations you are linking to apply to the XML specs

That is the same link given in the previous comment.

I then came across this article, which explained things a bit clearer/simpler and pointed to this: https://www.w3.org/TR/REC-xml/#NT-NameChar

avpaderno commented 4 months ago

If attribute names must consist of one or more characters other than the space characters, U+0000 NULL, """, "'", ">", "/", "=", the control characters, then also the regular expression used by Drupal is too restrictive.
Given the excluded characters, there is no risk of XSS, by allowing all the characters HTML allows for attribute names.

avpaderno commented 4 months ago

I understand what this issue is trying to achieve, but if the regular expression is going to be changed, it should be changed to allow what HTML allows, so there will not be any further issue we could discover later.

izmeez commented 4 months ago

@kiamlaluno Thanks for the links. The first is more simple, is it enough?

The regular expression, therefore, for parsing an HTML attribute is as follows: [a-zA-Z_:][-a-zA-Z0-9_:.] This, obviously, leaves the CombiningChar and Extender characters out of the mix.

And, a final note: When reading the W3C’s specifications, does anyone else have difficulty finding what you’re looking for? I swear, when I’m looking for something simple, the mountains of documents I have to wade through makes it impossible to find anything with any accuracy.

klonos commented 4 months ago

... if the regular expression is going to be changed, it should be changed to allow what HTML allows, so there will not be any further issue we could discover later.

That would need to be a decision by the @backdrop/core-committers and the Security Team I guess. On one hand, I understand the value in allowing what the specs allow - on the other hand though, restricting things to the minimum that we need in order to get things done sounds less risky and more "future-proof" ...which I guess is the reason behind it being that way for so long even in Drupal - only allow what is really needed - nothing more. Well, now with the introduction of SVG support in core, we actually have a need for more, but one could very well argue not a need for everything possible.

In fact, I think that perhaps we should either introduce a new parameter in the existing _filter_xss_attributes() function, or introduce new function(s), specific to SVG/XML. That should minimise any attack/risk vectors, because we'd use these functions/parameters specifically and explicitly where needed, without being more permissive/risky with existing functions. For example, we already have filter_xss() and we've added filter_xss_admin() on top of that, which allows a wider variety of HTML tags. Perhaps we should introduce a filter_xss_svg() or filter_xss_xml() function?

avpaderno commented 4 months ago

@izmeez I apologize for the confusion: The links I gave are the ones gave from @klonos. I was just trying to understand which attribute names are allowed, to make a better comment on the proposed regular expressions.

avpaderno commented 4 months ago

I totally agree that not allowing attributes like y1 or x12 is a way too restrictive. I was just looking for a way to allow other attributes as well, considering that _filter_xss_attributes() should avoid XSS attacks. It should not be more restrictive than necessary.

I also agree that using the same function in two different contexts (HTML and XML) makes writing code that is correct in both the cases difficult.

klonos commented 3 months ago

OK, here's an initial PR: https://github.com/backdrop/backdrop/pull/4805

crossports the D7 https://www.drupal.org/project/drupal/issues/2847553 / https://git.drupalcode.org/project/drupal/-/commit/6b7e2b1ae48bfbad052085bd7a335751b8310cdb (which in turn is a backport of the D8 https://www.drupal.org/project/drupal/issues/2609928 / https://www.drupal.org/commitlog/commit/2/3e28af5d77c901f39ce6568d7f0ccfa90b63b94c)
removes the dash from being allowed as the first character of attribute names (as per https://www.w3.org/TR/REC-xml/#NT-Name)
allows underscore as the first character (but not elsewhere, also as per https://www.w3.org/TR/REC-xml/#NT-Name), which is being discussed in https://www.drupal.org/project/drupal/issues/2844627
adds an extra test, specific to svg <line> and the x1/x2/y1/y2 attributes, which should help catch any regression explicitly around that in the future.

PS: I've added a comment in https://www.drupal.org/project/drupal/issues/2844627#comment-15651494 to point out that the standard allows underscores as the first character, but NOT characters after that. Also pointed out that the current regex implementation allows dashes as the first letter, which is against the standard.

avpaderno commented 3 months ago

I left a comment in the PR: The CSpell directives do not include a period.

klonos commented 3 months ago

Thanks @kiamlaluno for pointing that out and @herbdool for removing the periods. I have my doubts about that, but in order to not derail this issue here, I've started a discussion in Zulip: https://backdrop.zulipchat.com/#narrow/stream/218635-Backdrop/topic/phpcs.20is.20in.20a.20war.20with.20cspell/near/446336928

PR still needs review/testing.

klonos commented 2 months ago

allows underscore as the first character (but not elsewhere, also as per https://www.w3.org/TR/REC-xml/#NT-Name), which is being discussed in https://www.drupal.org/project/drupal/issues/2844627

In https://www.drupal.org/project/drupal/issues/2844627 someone mentioned the XML vs XHTML/HTML standards, so perhaps we should allow underscores in the second part of the regex.

Either way, I would like to get this change committed, in order to unblock #6487 for the next minor release. The main change that I'm interested in is allowing numbers after the first character of attributes, which is the current bug. So I am thinking that (as I also mentioned in the d.org issue) since the purpose of _filter_xss_attributes() is to filter XSS attempts (as the name of the function implies) rather than making sure that we adhere to the standards/specs strictly 100%, then we shouldn't focus on trying to make it do that (at least not consider it a primary goal - keep it in mind and make best efforts, but not be pedantic about it).

Thoughts?

avpaderno commented 2 months ago

So I am thinking that (as I also mentioned in the d.org issue) since the purpose of _filter_xss_attributes() is to filter XSS attempts (as the name of the function implies) rather than making sure that we adhere to the standards/specs strictly 100%, then we shouldn't focus on trying to make it do that (at least not consider it a primary goal - keep it in mind and make best efforts, but not be pedantic about it).

_filter_xss_attributes() should not strip out perfectly fine attributes we are going to use. I would not go into accepting every attribute the HTML/XML recommendations allow, especially the edge cases.

klonos commented 2 months ago

@avpaderno yes, not edge cases. The standard allows for colons (:), dots (.), and mid-dots (·), but as also stated in the inline comment in the PR, we should be treating those as edge cases and not supporting them. What about underscores though? Those seem like a common use case. Should we allow them, despite not adhering 100% to the spec?

avpaderno commented 2 months ago

An underscore is allowed at the beginning of the attribute name. I would make the code permissive about underscores after the first character.

avpaderno commented 2 months ago

I am not sure about considering a colon an edge case: It is perfectly fine to have a SVG like the following.

<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE svg PUBLIC "-//W3C//DTD SVG 1.1//EN" "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd">
<svg width="391" height="391" viewBox="-70.5 -70.5 391 391" xmlns="http://www.w3.org/2000/svg" xmlns:xlink="http://www.w3.org/1999/xlink">
<rect fill="#fff" stroke="#000" x="-70" y="-70" width="390" height="390"/>
<g opacity="0.8">
    <rect x="25" y="25" width="200" height="200" fill="lime" stroke-width="4" stroke="pink" />
    <circle cx="125" cy="125" r="75" fill="orange" />
    <polyline points="50,150 50,200 200,200 200,100" stroke="red" stroke-width="4" fill="none" />
    <line x1="50" y1="50" x2="200" y2="200" stroke="blue" stroke-width="4" />
</g>
</svg>

xmlns:xlink is not an attack vector, and the given link is not supposed to be used to render the SVG file. (It is just an XML namespace.)

klonos commented 2 months ago

Good catch re xmlns:xlink @avpaderno 👍🏼 ...I just read through https://stackoverflow.com/questions/18467982/are-svg-parameters-such-as-xmlns-and-version-needed and https://developer.mozilla.org/en-US/docs/Web/SVG/Namespaces_Crash_Course as well and now have a better understanding of XML namespaces and prefixes. So yeah, a colon may not be such an edge case as I thought initially after all.

Having said that, looking at the list of attributes in the sidebar of the page in the last link above, I do see a series of xlink:* and xml:* attributes, but they have all been deprecated (and oddly, no mention of xmlns:xlink specifically). That doesn't mean that they aren't in use though; nor that other namespaces won't be used anywhere else.

Before I update the PR accordingly, I'll wait for some feedback from others - especially our @backdrop/core-committers and Sec team. In the meantime, I'd also like a decision re allowing underscores in general.

Summary to help things:

colon:
- allowed by the spec, HOWEVER: it should only be allowed as the first character though (in what the spec names NameStartChar) - practical examples show it being used in the middle of the attribute name though (like in xmlns:xlink).
- not such an edge case after all
- not allowed by the current regex in Backdrop's _filter_xss_attributes()
- not allowed by the regex in modern Drupal
underscore:
- allowed by the spec, but only as the first character of attributes in XML
- XHTML/HTML seems to allow it in general, even after the first character
- seems to be common(?)
- not allowed at all by the current regex in Backdrop's _filter_xss_attributes()
- not allowed by the regex in modern Drupal, but being discussed in https://www.drupal.org/project/drupal/issues/2844627 (which is also marked as backport to D7)

Other notable things:

should we in general be trying to strictly conform 100% to the specs in the regex in _filter_xss_attributes(), or just cater for security and be otherwise "permissive" about spec non-conformities?
if we are strict about it in core, do we allow contrib/custom to alter the regex (separate issue), or how do we help them in these scenarios?
if we are strict in about it in core, then do we cater for XML, or XHTML/HTML, or both/all?
should we cater for things like not allowing _- or _-- or _--- or test---123 etc. as attribute names or ignore those as edge cases? (will increase the complexity of the regex, so perhaps follow-up issue as well)

klonos commented 2 months ago

...I've added the needs feedback label to this, and also tentatively needs change record (which may mean that we should only add this in 1.29.0).

klonos commented 2 months ago

@avpaderno sorry to be going back and forth about this, but I now think that we might have been interpreting/reading the spec wrong. Ugh!... can you please also do a sanity check for me?:

The spec says:

NameStartChar (1st character of the attribute) may be : | [A-Z] | _ | [a-z] etc.
NameChar (any character of the attribute after the 1st one) may be - | . | [0-9] BUT in addition to the characters allowed by NameStartChar
the attribute Name may then be NameStartChar (NameChar)*

Specifically point 2 above. Am I reading it right?

avpaderno commented 2 months ago

Yes, NameChar includes the characters listed for NameStartChar, which means that for example :a:attribute-name would be valid.

backdrop / backdrop-issues

[D8] Backport the cleanup and fixes from `Xss::attributes()` to our own `_filter_xss_attributes()` function #6575