char regex .*{0,1} is incorrect (antisamy.xml)

dqw / owaspantisamy

Automatically exported from code.google.com/p/owaspantisamy

0 stars 0 forks source link

Looking at antisamy.xml, SVN revision 137: <attribute name="char"> <regexp-list> <regexp value=".*{0,1}"/> </regexp-list> </attribute> I think the intent is to allow zero or one character, as described at http://www.w3.org/TR/html401/types.html#type-character. If that's the intent, the regex should be ".{0,1}". To be 100% correct, however, the regex should also allow character references, including numeric character references such as å or ひ (see http://www.w3.org/TR/html401/charset.html#h-5.3.1) and character entity references such as < or " (see http://www.w3.org/TR/html401/charset.html#h-5.3.2 and http://www.w3.org/TR/html401/charset.html#entities).

At runtime this will be enforced correctly. The """ will be treated as a single
character. I confirmed it with the following test case:

String s = "<td char='.'>test</td>";
CleanResults cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );

s = "<td char='..'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

s = "<td char='"'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") > -1 );

s = "<td char='"a'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

s = "<td char='"&'>test</td>";
cr = as.scan(s, policy);
assertTrue(cr.getCleanHTML().indexOf("char") == -1 );

Original comment by arshan.d...@gmail.com on 8 Mar 2010 at 5:54

Changed state: Invalid

dqw / owaspantisamy

char regex .*{0,1} is incorrect (antisamy.xml) #69